Data Mining 03 - Review Exploratory Data Analysis for Data Mining

Outline:¶

  • Pendahuluan EDA
  • Import/Load Data Data
  • Dasar Data Preparation (tipe data, duplikasi, var selection)
  • Noise vs Outliers
  • Missing Values dan Imputasi
  • Basic Statistics
  • Exporting Data
  • Visualizations
  • Interpretation and recommendations

Pendahuluan:¶

  • Exploratory Data Analysis (EDA) bagaikan jiwa bagi semua proses analisa data. Kemampuan untuk melakukan EDA dengan baik adalah syarat dasar utama bagi seluruh profesi yang terkait dengan pengolahan data, baik itu business intelligence, data analyst, data scientist, dan sebagainya. EDA juga menjadi tahapan awal dari kebanyakan proses analisa data dan menjadi suatu tahapan yang amat menentukan seberapa baik analisa data selanjutnya akan dihasilkan.

  • Diperkenalkan oleh John Tukey 1961: " Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."

  • Komponen EDA meliputi preprocessing, perhitungan berbagai nilai statistics dasar (e.g. ukuran pusat dan penyebaran data), visualisasi, penyusunan hipotesis (dugaan awal), pemeriksaan asumsi, hingga story-telling dan reporting. Di dalamnya juga termasuk proses penanganan missing values, outlier, reduksi dimensi, pengelompokkan, transformasi dan distribusi data.

  • Tools: Python, R, S-Plus, etc

Tujuan EDA¶

  • Suggest hypotheses about the causes of observed phenomena
  • Assess assumptions on which statistical inference will be based
  • Support the selection of appropriate statistical techniques
  • Provide a basis for further data collection

Data(set)

  • Koleksi entitas/objek data dan atributnya
  • Atribut adalah sifat atau karkteristik dari objek
  • Contoh pada objek manusia: umur, berat badan, tinggi badan, jenis kelamin, dsb.
  • Setiap atribut memiliki beberapa kemungkinan "state", sebagai contoh: pria/wanita.
  • koleksi atribut mendefinisikan suatu objek.

Seringkali pada saat terjun ke lapangan, data yang kita dapat tidak datang dalam keadaan rapi dan bersih, bahkan seringkali data yang kita peroleh sangat berantakan, diperlukan usaha ekstra untuk mempersiapkan data tersebut untuk siap dilakukan analisis

image Source: https://miro.medium.com/max/1869/0*1-i9w0e4kklVQl5B.jpg

Preprocessing¶

  • Kunci utama dalam mendapatkan model yang valid & reliable.
  • Preprocessing yang berbeda akan berpotensi menghasilkan kesimpulan/insight yang berbeda.
  • Model yang berbeda juga bisa jadi membutuhkan Preprocessing yang berbeda juga..

Beberapa Proses Dasar¶

  • Seleksi variable dan "Join"
  • Data Cleaning : Duplikasi, Noise dan Outliers
  • Transformasi Data
  • Dimensional Reduction

Data Understanding: Relevance¶

  • Data apa yang tersedia?
  • Seberapa banyak (dan lama) data tersedia?
  • Ada yang memiliki label? (Variabel Target)
  • Apakah data ini relevan? Atau bisa dibuat relevan?
  • Bagaimana dengan kualitas data ini?
  • Ada data tambahan (eksternal)?
  • Siapa yang memahami tentang data ini dengan baik di perusahaan?

Mengapa perlu preprocessing?¶

  • Data di dunia nyata biasanya tidak sebersih/indah data di buku akademik.
    • Noise: Misal gaji bernilai negatif
    • Ouliers: Misal seseorang dengan penghasilan >500 juta/bulan.
    • Duplikasi: Banyak di media sosial
    • Encodings, dsb: Banyak di Big Data, karena masalah bagaimana data disimpan/join.
  • Tidak lengkap: hanya agregat, kurang variabel penting, dsb.
  • Analisa pada data yang tidak di preprocess biasanya menghasilkan insight yang tidak/kurang tepat.

Garbage in-Garbage out¶

Beberapa langkah utama:¶

  1. Data Gathering:
  • Data warehouse, database, web crawling/scrapping/streaming.
  • Identifikasi, ekstraksi, dan integrasi data
  1. Data Cleaning:
  2. Transformasi data (misal encoding var kategorik)
  3. Normalisasi/standarisasi
  4. Data reduction:
  • variable selection (domain knowledge/automatic)
  • Feature Engineering
  • Variable reduction
  • image source: https://machinelearningmastery.com/framework-for-data-preparation-for-machine-learning/?fbclid=IwAR2KFDHPYPQ-Xw0A_UxZbRLXt_EdxFKiHpBvmxoNKThVMfGUM0MJWGbC20k

Import-Loading Data CSV / Excel Data via Pandas¶

  • Importing CSV file https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
  • Importing Excel file https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
  • Encodings https://docs.python.org/3/library/codecs.html#standard-encodings
In [1]:
import warnings; warnings.simplefilter('ignore')

try:
    import google.colab; IN_COLAB = True
    print("Installing the required modules")
    !pip install lxml folium
    !mkdir data images output
    #!wget -P data/ https://raw.githubusercontent.com/taudataanalytics/eLearning/master/data/price.csv
except:
    IN_COLAB = False
    print("Running the code locally, please make sure all of the python module versions agree with colab environment and all data/assets downloaded")
Running the code locally, please make sure all of the python module versions agree with colab environment and all data/assets downloaded
In [2]:
import pandas as pd

file_ = 'data/price.csv'
try: # Running Locally, yakinkan "file_" berada di folder "data"
    price = pd.read_csv(file_, low_memory = False, encoding='utf8')
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudataanalytics/Data-Mining--Penambangan-Data--Ganjil-2024/master/data/price.csv
    price = pd.read_csv(file_, low_memory = False, encoding='utf8')
    
N, P = price.shape # Ukuran Data
print('baris = ', N, ', Kolom (jumlah variabel) = ', P)
print("Tipe Variabe df = ", type(price))
price
baris =  936 , Kolom (jumlah variabel) =  10
Tipe Variabe df =  <class 'pandas.core.frame.DataFrame'>
Out[2]:
Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
0 1 9796.0 5250.0 10703.0 1659.0 1961.0 Open CAT B 530 6649000
1 2 8294.0 8186.0 12694.0 1461.0 1752.0 Not Provided CAT B 210 3982000
2 3 11001.0 14399.0 16991.0 1340.0 1609.0 Not Provided CAT A 720 5401000
3 4 8301.0 11188.0 12289.0 1451.0 1748.0 Covered CAT B 620 5373000
4 5 10510.0 12629.0 13921.0 1770.0 2111.0 Not Provided CAT B 450 4662000
... ... ... ... ... ... ... ... ... ... ...
931 932 9297.0 12537.0 14418.0 1174.0 1429.0 Covered CAT C 1110 5434000
932 933 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
933 934 9205.0 10418.0 14496.0 1118.0 1337.0 Open CAT A 560 7227000
934 935 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
935 936 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000

936 rows × 10 columns

Bagaimana dengan File Excel?¶

Karena deprecated support, maka harus install module "openpyxl" terlebih dahulu¶

  • Importing Excel file https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
  • openpyxl https://openpyxl.readthedocs.io/en/stable/
In [3]:
# Jika anda menjalankan Jupyter notebook ini secara lokal, maka perlu penyesuaian
try:
    import google.colab; IN_COLAB = True
    !pip install openpyxl
except:
    print('Jika belum, silahkan install module openpyxl dari terminal Env anda (recommended).') #IN_COLAB = False
Jika belum, silahkan install module openpyxl dari terminal Env anda (recommended).
In [4]:
file_ = 'data/price.xlsx'
try: # Running Locally 
    xl = pd.ExcelFile(file_, engine = 'openpyxl')
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/{file_}
    xl = pd.ExcelFile(file_, engine = 'openpyxl')

sheets_ = xl.sheet_names
print(sheets_)
price = xl.parse(sheets_[0], header=0) #biasakan tidak menulis nama sheet secara langsung

N, P = price.shape # Ukuran Data
print('baris = ', N, ', Kolom (jumlah variabel) = ', P)
print("Tipe Variabe df = ", type(price))
price
['price1', 'price2']
baris =  936 , Kolom (jumlah variabel) =  10
Tipe Variabe df =  <class 'pandas.core.frame.DataFrame'>
Out[4]:
Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
0 1 9796.0 5250.0 10703.0 1659.0 1961.0 Open CAT B 530 6649000
1 2 8294.0 8186.0 12694.0 1461.0 1752.0 Not Provided CAT B 210 3982000
2 3 11001.0 14399.0 16991.0 1340.0 1609.0 Not Provided CAT A 720 5401000
3 4 8301.0 11188.0 12289.0 1451.0 1748.0 Covered CAT B 620 5373000
4 5 10510.0 12629.0 13921.0 1770.0 2111.0 Not Provided CAT B 450 4662000
... ... ... ... ... ... ... ... ... ... ...
931 932 9297.0 12537.0 14418.0 1174.0 1429.0 Covered CAT C 1110 5434000
932 933 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
933 934 9205.0 10418.0 14496.0 1118.0 1337.0 Open CAT A 560 7227000
934 935 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
935 936 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000

936 rows × 10 columns

In [5]:
df = pd.read_excel(file_, sheet_name='price1')
df.head()
Out[5]:
Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
0 1 9796.0 5250.0 10703.0 1659.0 1961.0 Open CAT B 530 6649000
1 2 8294.0 8186.0 12694.0 1461.0 1752.0 Not Provided CAT B 210 3982000
2 3 11001.0 14399.0 16991.0 1340.0 1609.0 Not Provided CAT A 720 5401000
3 4 8301.0 11188.0 12289.0 1451.0 1748.0 Covered CAT B 620 5373000
4 5 10510.0 12629.0 13921.0 1770.0 2111.0 Not Provided CAT B 450 4662000

Prefer XLS atau CSV di Data Science/Machine Learning ... Mengapa?¶

Import-Loading Data MySQL via Pandas¶

  • https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html

Web API¶

  • https://panda.readthedocs.io/en/latest/api_tutorial.html
  • https://towardsdatascience.com/scraping-tabular-data-with-pandas-python-10cf2a133cbf
In [6]:
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
df_list = pd.read_html(url) # Hati-hati ini List!
print(len(df_list))
df_list
48
Out[6]:
[                                                    0  \
 0   List of the world's billionaires, ranked in or...   
 1   The net worth of the world's billionaires incr...   
 2                                 Publication details   
 3                                           Publisher   
 4                                         Publication   
 5                                     First published   
 6                                  Latest publication   
 7                      Current list details (2024)[2]   
 8                                          Wealthiest   
 9                                     Net worth (1st)   
 10                             Number of billionaires   
 11                         Total list net worth value   
 12                                    Number of women   
 13                                      Number of men   
 14                            New members to the list   
 15           Forbes: The World's Billionaires website   
 
                                                     1  
 0   List of the world's billionaires, ranked in or...  
 1   The net worth of the world's billionaires incr...  
 2                                 Publication details  
 3                Whale Media InvestmentsForbes family  
 4                                              Forbes  
 5                                       March 1987[1]  
 6                                       April 2, 2024  
 7                      Current list details (2024)[2]  
 8                                     Bernard Arnault  
 9                                      US$233 billion  
 10                                  2,781 (from 2640)  
 11           US$14.2 trillion (from US$12.2 trillion)  
 12                                                383  
 13                                               2398  
 14                                                141  
 15           Forbes: The World's Billionaires website  ,
    Icon                                 Description
 0   NaN  Has not changed from the previous ranking.
 1   NaN    Has increased from the previous ranking.
 2   NaN    Has decreased from the previous ranking.,
    No.                      Name Net worth (USD)  Age  \
 0    1  Bernard Arnault & family    $233 billion   75   
 1    2                 Elon Musk    $195 billion   52   
 2    3                Jeff Bezos    $194 billion   60   
 3    4           Mark Zuckerberg    $177 billion   39   
 4    5             Larry Ellison    $141 billion   79   
 5    6            Warren Buffett    $133 billion   93   
 6    7                Bill Gates    $128 billion   68   
 7    8             Steve Ballmer    $121 billion   68   
 8    9             Mukesh Ambani    $116 billion   66   
 9   10                Larry Page    $114 billion   51   
 
                            Nationality Primary source(s) of wealth  
 0                               France                        LVMH  
 1  South Africa  Canada  United States               Tesla, SpaceX  
 2                        United States                      Amazon  
 3                        United States              Meta Platforms  
 4                        United States          Oracle Corporation  
 5                        United States          Berkshire Hathaway  
 6                        United States                   Microsoft  
 7                        United States                   Microsoft  
 8                                India         Reliance Industries  
 9                        United States                      Google  ,
    No.                      Name Net worth (USD)  Age  \
 0    1  Bernard Arnault & family    $211 billion   74   
 1    2                 Elon Musk    $180 billion   51   
 2    3                Jeff Bezos    $114 billion   59   
 3    4             Larry Ellison    $107 billion   78   
 4    5            Warren Buffett    $106 billion   92   
 5    6                Bill Gates    $104 billion   67   
 6    7         Michael Bloomberg   $94.5 billion   81   
 7    8      Carlos Slim & family     $93 billion   83   
 8    9             Mukesh Ambani   $83.4 billion   65   
 9   10             Steve Ballmer   $80.7 billion   67   
 
                            Nationality         Primary source(s) of wealth  
 0                               France                                LVMH  
 1  South Africa  Canada  United States                       Tesla, SpaceX  
 2                        United States                              Amazon  
 3                        United States                  Oracle Corporation  
 4                        United States                  Berkshire Hathaway  
 5                        United States                           Microsoft  
 6                        United States                      Bloomberg L.P.  
 7                               Mexico  Telmex, América Móvil, Grupo Carso  
 8                                India                 Reliance Industries  
 9                        United States                           Microsoft  ,
    No.                      Name Net worth (USD)  Age  \
 0    1                 Elon Musk    $219 billion   50   
 1    2                Jeff Bezos    $177 billion   58   
 2    3  Bernard Arnault & family    $158 billion   73   
 3    4                Bill Gates    $129 billion   66   
 4    5            Warren Buffett    $118 billion   91   
 5    6                Larry Page    $111 billion   49   
 6    7               Sergey Brin    $107 billion   48   
 7    8             Larry Ellison    $106 billion   77   
 8    9             Steve Ballmer   $91.4 billion   66   
 9   10             Mukesh Ambani   $90.7 billion   64   
 
                            Nationality Primary source(s) of wealth  
 0  South Africa  Canada  United States               Tesla, SpaceX  
 1                        United States                      Amazon  
 2                               France                        LVMH  
 3                        United States                   Microsoft  
 4                        United States          Berkshire Hathaway  
 5                        United States                      Google  
 6                        United States                      Google  
 7                        United States          Oracle Corporation  
 8                        United States                   Microsoft  
 9                                India         Reliance Industries  ,
    No.                      Name Net worth (USD)  Age  \
 0    1                Jeff Bezos    $177 billion   57   
 1    2                 Elon Musk    $151 billion   49   
 2    3  Bernard Arnault & family    $150 billion   72   
 3    4                Bill Gates    $124 billion   65   
 4    5           Mark Zuckerberg     $97 billion   36   
 5    6            Warren Buffett     $96 billion   90   
 6    7             Larry Ellison     $93 billion   76   
 7    8                Larry Page   $91.5 billion   48   
 8    9               Sergey Brin     $89 billion   47   
 9   10             Mukesh Ambani   $84.5 billion   63   
 
                            Nationality  Source(s) of wealth  
 0                        United States               Amazon  
 1  South Africa  Canada  United States        Tesla, SpaceX  
 2                               France                 LVMH  
 3                        United States            Microsoft  
 4                        United States       Meta Platforms  
 5                        United States   Berkshire Hathaway  
 6                        United States   Oracle Corporation  
 7                        United States               Google  
 8                        United States               Google  
 9                                India  Reliance Industries  ,
    No.                      Name Net worth (USD)  Age    Nationality  \
 0    1                Jeff Bezos    $113 billion   56  United States   
 1    2                Bill Gates     $98 billion   64  United States   
 2    3  Bernard Arnault & family     $76 billion   71         France   
 3    4            Warren Buffett   $67.5 billion   89  United States   
 4    5             Larry Ellison     $59 billion   75  United States   
 5    6            Amancio Ortega   $55.1 billion   84          Spain   
 6    7           Mark Zuckerberg   $54.7 billion   35  United States   
 7    8                Jim Walton   $54.6 billion   71  United States   
 8    9              Alice Walton   $54.4 billion   70  United States   
 9   10          S. Robson Walton   $54.1 billion   77  United States   
 
   Source(s) of wealth  
 0              Amazon  
 1           Microsoft  
 2                LVMH  
 3  Berkshire Hathaway  
 4  Oracle Corporation  
 5       Inditex, Zara  
 6      Facebook, Inc.  
 7             Walmart  
 8             Walmart  
 9             Walmart  ,
    No.               Name Net worth (USD)  Age    Nationality  \
 0    1         Jeff Bezos    $131 billion   55  United States   
 1    2         Bill Gates   $96.5 billion   63  United States   
 2    3     Warren Buffett   $82.5 billion   88  United States   
 3    4    Bernard Arnault     $76 billion   70         France   
 4    5        Carlos Slim     $64 billion   79         Mexico   
 5    6     Amancio Ortega   $62.7 billion   82          Spain   
 6    7      Larry Ellison   $62.5 billion   74  United States   
 7    8    Mark Zuckerberg   $62.3 billion   34  United States   
 8    9  Michael Bloomberg   $55.5 billion   77  United States   
 9   10         Larry Page   $50.8 billion   45  United States   
 
           Source(s) of wealth  
 0                      Amazon  
 1                   Microsoft  
 2          Berkshire Hathaway  
 3                        LVMH  
 4  América Móvil, Grupo Carso  
 5               Inditex, Zara  
 6          Oracle Corporation  
 7              Facebook, Inc.  
 8              Bloomberg L.P.  
 9                      Google  ,
    No.             Name Net worth (USD)  Age    Nationality  \
 0    1       Jeff Bezos    $112 billion   54  United States   
 1    2       Bill Gates     $90 billion   62  United States   
 2    3   Warren Buffett     $84 billion   87  United States   
 3    4  Bernard Arnault     $72 billion   69         France   
 4    5  Mark Zuckerberg     $71 billion   33  United States   
 5    6   Amancio Ortega     $70 billion   81          Spain   
 6    7      Carlos Slim   $67.1 billion   78         Mexico   
 7    8     Charles Koch     $60 billion   82  United States   
 8    8       David Koch     $60 billion   77  United States   
 9   10    Larry Ellison   $58.5 billion   73  United States   
 
           Source(s) of wealth  
 0                      Amazon  
 1                   Microsoft  
 2          Berkshire Hathaway  
 3                        LVMH  
 4              Facebook, Inc.  
 5               Inditex, Zara  
 6  América Móvil, Grupo Carso  
 7             Koch Industries  
 8             Koch Industries  
 9          Oracle Corporation  ,
    No.               Name Net worth (USD)  Age    Nationality  \
 0    1         Bill Gates   $86.0 billion   61  United States   
 1    2     Warren Buffett   $75.6 billion   86  United States   
 2    3         Jeff Bezos   $72.8 billion   53  United States   
 3    4     Amancio Ortega   $71.3 billion   80          Spain   
 4    5    Mark Zuckerberg   $56.0 billion   32  United States   
 5    6        Carlos Slim   $54.5 billion   77         Mexico   
 6    7      Larry Ellison   $52.2 billion   72  United States   
 7    8       Charles Koch   $48.3 billion   81  United States   
 8    8         David Koch   $48.3 billion   76  United States   
 9   10  Michael Bloomberg   $47.5 billion   75  United States   
 
           Source(s) of wealth  
 0                   Microsoft  
 1          Berkshire Hathaway  
 2                      Amazon  
 3               Inditex, Zara  
 4              Facebook, Inc.  
 5  América Móvil, Grupo Carso  
 6          Oracle Corporation  
 7             Koch Industries  
 8             Koch Industries  
 9              Bloomberg L.P.  ,
    No.               Name Net worth (USD)  Age    Nationality  \
 0    1         Bill Gates   $75.0 billion   60  United States   
 1    2     Amancio Ortega   $67.0 billion   79          Spain   
 2    3     Warren Buffett   $60.8 billion   85  United States   
 3    4        Carlos Slim   $50.0 billion   76         Mexico   
 4    5         Jeff Bezos   $45.2 billion   52  United States   
 5    6    Mark Zuckerberg   $44.6 billion   31  United States   
 6    7      Larry Ellison   $43.6 billion   71  United States   
 7    8  Michael Bloomberg   $40.0 billion   74  United States   
 8    9       Charles Koch   $39.6 billion   80  United States   
 9    9         David Koch   $39.6 billion   75  United States   
 
           Source(s) of wealth  
 0                   Microsoft  
 1                     Inditex  
 2          Berkshire Hathaway  
 3  América Móvil, Grupo Carso  
 4                      Amazon  
 5              Facebook, Inc.  
 6          Oracle Corporation  
 7              Bloomberg L.P.  
 8             Koch Industries  
 9             Koch Industries  ,
    No.                 Name Net worth (USD)  Age    Nationality  \
 0    1           Bill Gates   $79.2 billion   59  United States   
 1    2          Carlos Slim   $77.1 billion   75         Mexico   
 2    3       Warren Buffett   $72.7 billion   84  United States   
 3    4       Amancio Ortega   $64.5 billion   78          Spain   
 4    5        Larry Ellison   $54.3 billion   70  United States   
 5    6         Charles Koch   $42.9 billion   79  United States   
 6    6           David Koch   $42.9 billion   74  United States   
 7    8       Christy Walton   $41.7 billion   66  United States   
 8    9           Jim Walton   $40.6 billion   66  United States   
 9   10  Liliane Bettencourt   $40.1 billion   92         France   
 
           Source(s) of wealth  
 0                   Microsoft  
 1  América Móvil, Grupo Carso  
 2          Berkshire Hathaway  
 3                     Inditex  
 4          Oracle Corporation  
 5             Koch Industries  
 6             Koch Industries  
 7                     Walmart  
 8                     Walmart  
 9                     L'Oreal  ,
    No.                     Name Net worth (USD)  Age    Nationality  \
 0    1               Bill Gates   $76.0 billion   58  United States   
 1    2     Carlos Slim & family   $72.0 billion   74         Mexico   
 2    3           Amancio Ortega   $64.0 billion   77          Spain   
 3    4           Warren Buffett   $58.2 billion   83  United States   
 4    5            Larry Ellison   $48.0 billion   70  United States   
 5    6             Charles Koch   $40.0 billion   78  United States   
 6    6               David Koch   $40.0 billion   73  United States   
 7    8          Sheldon Adelson   $38.0 billion   80  United States   
 8    9  Christy Walton & family   $36.7 billion   65  United States   
 9   10               Jim Walton   $34.7 billion   65  United States   
 
           Source(s) of wealth  
 0                   Microsoft  
 1  América Móvil, Grupo Carso  
 2                     Inditex  
 3          Berkshire Hathaway  
 4          Oracle Corporation  
 5             Koch Industries  
 6             Koch Industries  
 7             Las Vegas Sands  
 8                     Walmart  
 9                     Walmart  ,
    No.                          Name Net worth (USD)  Age    Nationality  \
 0    1          Carlos Slim & family   $73.0 billion   73         Mexico   
 1    2                    Bill Gates   $67.0 billion   57  United States   
 2    3                Amancio Ortega   $57.0 billion   76          Spain   
 3    4                Warren Buffett   $53.5 billion   82  United States   
 4    5                 Larry Ellison   $43.0 billion   68  United States   
 5    6                  Charles Koch   $34.0 billion   77  United States   
 6    6                    David Koch   $34.0 billion   72  United States   
 7    8                   Li Ka-shing   $31.0 billion   84      Hong Kong   
 8    9  Liliane Bettencourt & family   $30.0 billion   90         France   
 9   10               Bernard Arnault   $29.0 billion   63         France   
 
           Source(s) of wealth  
 0  América Móvil, Grupo Carso  
 1                   Microsoft  
 2               Inditex Group  
 3          Berkshire Hathaway  
 4          Oracle Corporation  
 5             Koch Industries  
 6             Koch Industries  
 7        Cheung Kong Holdings  
 8                     L'Oréal  
 9                        LVMH  ,
    No.                  Name Net worth (USD)  Age    Nationality  \
 0    1  Carlos Slim & family   $69.0 billion   72         Mexico   
 1    2            Bill Gates   $61.0 billion   56  United States   
 2    3        Warren Buffett   $44.0 billion   81  United States   
 3    4       Bernard Arnault   $41.0 billion   63         France   
 4    5        Amancio Ortega   $37.5 billion   75          Spain   
 5    6         Larry Ellison   $36.0 billion   67  United States   
 6    7          Eike Batista   $30.0 billion   55         Brazil   
 7    8        Stefan Persson   $26.0 billion   64         Sweden   
 8    9           Li Ka-shing   $25.5 billion   83      Hong Kong   
 9   10         Karl Albrecht   $25.4 billion   92        Germany   
 
                   Source(s) of wealth  
 0          América Móvil, Grupo Carso  
 1                           Microsoft  
 2                  Berkshire Hathaway  
 3  LVMH Moët Hennessy • Louis Vuitton  
 4                       Inditex Group  
 5                  Oracle Corporation  
 6                           EBX Group  
 7                                 H&M  
 8                Cheung Kong Holdings  
 9                                Aldi  ,
    No.                     Name Net worth (USD)  Age    Nationality  \
 0    1              Carlos Slim   $74.0 billion   71         Mexico   
 1    2               Bill Gates   $56.0 billion   55  United States   
 2    3           Warren Buffett   $50.0 billion   80  United States   
 3    4          Bernard Arnault   $41.0 billion   62         France   
 4    5            Larry Ellison   $39.5 billion   66  United States   
 5    6           Lakshmi Mittal   $31.1 billion   60          India   
 6    7           Amancio Ortega   $31.0 billion   74          Spain   
 7    8             Eike Batista   $30.0 billion   53         Brazil   
 8    9            Mukesh Ambani   $27.0 billion   54          India   
 9   10  Christy Walton & family   $26.5 billion   62  United States   
 
                   Source(s) of wealth  
 0          América Móvil, Grupo Carso  
 1                           Microsoft  
 2                  Berkshire Hathaway  
 3  LVMH Moët Hennessy • Louis Vuitton  
 4                  Oracle Corporation  
 5                      Arcelor Mittal  
 6                       Inditex Group  
 7                           EBX Group  
 8                 Reliance Industries  
 9                             Walmart  ,
    No.                  Name Net worth (USD)  Age    Nationality  \
 0    1  Carlos Slim & family   $53.5 billion   70         Mexico   
 1    2            Bill Gates   $53.0 billion   54  United States   
 2    3        Warren Buffett   $47.0 billion   79  United States   
 3    4         Mukesh Ambani   $29.0 billion   53          India   
 4    5        Lakshmi Mittal   $28.7 billion   60          India   
 5    6         Larry Ellison   $28.0 billion   66  United States   
 6    7       Bernard Arnault   $27.5 billion   61         France   
 7    8          Eike Batista   $27.0 billion   53         Brazil   
 8    9        Amancio Ortega   $25.0 billion   74          Spain   
 9   10         Karl Albrecht   $23.5 billion   90        Germany   
 
                   Source(s) of wealth  
 0          América Móvil, Grupo Carso  
 1                           Microsoft  
 2                  Berkshire Hathaway  
 3                 Reliance Industries  
 4                      Arcelor Mittal  
 5                  Oracle Corporation  
 6  LVMH Moët Hennessy • Louis Vuitton  
 7                           EBX Group  
 8                       Inditex Group  
 9                            Aldi Süd  ,
    No.            Name Net worth (USD)  Age    Nationality  \
 0    1      Bill Gates   $40.0 billion   53  United States   
 1    2  Warren Buffett   $37.0 billion   78  United States   
 2    3     Carlos Slim   $35.0 billion   69         Mexico   
 3    4   Larry Ellison   $22.5 billion   64  United States   
 4    5  Ingvar Kamprad   $22.0 billion   83         Sweden   
 5    6   Karl Albrecht   $21.5 billion   89        Germany   
 6    7   Mukesh Ambani   $19.5 billion   52          India   
 7    8  Lakshmi Mittal   $19.3 billion   58          India   
 8    9   Theo Albrecht   $18.8 billion   87        Germany   
 9   10  Amancio Ortega   $18.3 billion   73          Spain   
 
           Source(s) of wealth  
 0                   Microsoft  
 1          Berkshire Hathaway  
 2  América Móvil, Grupo Carso  
 3          Oracle Corporation  
 4                        IKEA  
 5                    Aldi Süd  
 6         Reliance Industries  
 7              Arcelor Mittal  
 8     Aldi Nord, Trader Joe's  
 9               Inditex Group  ,
    No.              Name Net worth (USD)  Age    Nationality  \
 0    1    Warren Buffett   $62.0 billion   77  United States   
 1    2       Carlos Slim   $60.0 billion   68         Mexico   
 2    3        Bill Gates   $58.0 billion   52  United States   
 3    4    Lakshmi Mittal   $45.0 billion   57          India   
 4    5     Mukesh Ambani   $43.0 billion   51          India   
 5    6       Anil Ambani   $42.0 billion   48          India   
 6    7    Ingvar Kamprad   $31.0 billion   81         Sweden   
 7    8  Kushal Pal Singh   $30.0 billion   76          India   
 8    9    Oleg Deripaska   $28.0 billion   40         Russia   
 9   10     Karl Albrecht   $27.0 billion   88        Germany   
 
            Source(s) of wealth  
 0           Berkshire Hathaway  
 1   América Móvil, Grupo Carso  
 2                    Microsoft  
 3               Arcelor Mittal  
 4          Reliance Industries  
 5  Anil Dhirubhai Ambani Group  
 6                         IKEA  
 7                    DLF Group  
 8                        Rusal  
 9                     Aldi Süd  ,
    No.             Name Net worth (USD)  Age    Nationality  \
 0    1       Bill Gates   $56.0 billion   51  United States   
 1    2   Warren Buffett   $52.0 billion   76  United States   
 2    3      Carlos Slim   $49.0 billion   67         Mexico   
 3    4   Ingvar Kamprad   $33.0 billion   80         Sweden   
 4    5   Lakshmi Mittal   $32.0 billion   56          India   
 5    6  Sheldon Adelson   $26.5 billion   73  United States   
 6    7  Bernard Arnault   $26.0 billion   58         France   
 7    8   Amancio Ortega   $24.0 billion   71          Spain   
 8    9      Li Ka-shing   $23.0 billion   78      Hong Kong   
 9   10    David Thomson   $22.0 billion   49         Canada   
 
                        Source(s) of wealth  
 0                                Microsoft  
 1                       Berkshire Hathaway  
 2               América Móvil, Grupo Carso  
 3                                     IKEA  
 4                           Arcelor Mittal  
 5                          Las Vegas Sands  
 6                                     LVMH  
 7                            Inditex Group  
 8  Cheung Kong Holdings, Hutchison Whampoa  
 9                      Thomson Corporation  ,
    No.                 Name Net worth (USD)  Age    Nationality  \
 0    1           Bill Gates   $52.0 billion   50  United States   
 1    2       Warren Buffett   $42.0 billion   75  United States   
 2    3          Carlos Slim   $30.0 billion   66         Mexico   
 3    4       Ingvar Kamprad   $28.0 billion   79         Sweden   
 4    5       Lakshmi Mittal   $23.5 billion   55          India   
 5    6           Paul Allen   $22.0 billion   53  United States   
 6    7      Bernard Arnault   $21.5 billion   57         France   
 7    8  Al-Waleed bin Talal   $20.0 billion   49   Saudi Arabia   
 8    9      Kenneth Thomson   $19.6 billion   82         Canada   
 9   10          Li Ka-shing   $18.8 billion   77      Hong Kong   
 
                     Source(s) of wealth  
 0                             Microsoft  
 1                    Berkshire Hathaway  
 2            América Móvil, Grupo Carso  
 3                                  IKEA  
 4                  Mittal Steel Company  
 5                             Microsoft  
 6    LVMH Moët Hennessy • Louis Vuitton  
 7               Kingdom Holding Company  
 8                   Thomson Corporation  
 9  Cheung Kong Group, Hutchison Whampoa  ,
    No.                 Name Net worth (USD)  Age    Nationality  \
 0    1           Bill Gates   $46.5 billion   49  United States   
 1    2       Warren Buffett   $44.0 billion   74  United States   
 2    3       Lakshmi Mittal   $25.0 billion   54          India   
 3    4          Carlos Slim   $23.8 billion   65         Mexico   
 4    5  Al-Waleed bin Talal   $23.7 billion   49   Saudi Arabia   
 5    6       Ingvar Kamprad   $23.0 billion   79         Sweden   
 6    7           Paul Allen   $21.0 billion   52  United States   
 7    8        Karl Albrecht   $18.5 billion   85        Germany   
 8    9        Larry Ellison   $18.4 billion   60  United States   
 9   10     S. Robson Walton   $18.3 billion   61  United States   
 
           Source(s) of wealth  
 0                   Microsoft  
 1          Berkshire Hathaway  
 2        Mittal Steel Company  
 3  América Móvil, Grupo Carso  
 4     Kingdom Holding Company  
 5                        IKEA  
 6                   Microsoft  
 7                    Aldi Süd  
 8          Oracle Corporation  
 9                     Walmart  ,
    No.                 Name Net worth (USD)  Age    Nationality  \
 0    1           Bill Gates   $46.6 billion   48  United States   
 1    2       Warren Buffett   $42.9 billion   73  United States   
 2    3        Karl Albrecht   $23.0 billion   84        Germany   
 3    4  Al-Waleed bin Talal   $21.5 billion   47   Saudi Arabia   
 4    5           Paul Allen   $21.0 billion   51  United States   
 5    6        Alice Walton*   $20.0 billion   55  United States   
 6    6        Helen Walton*   $20.0 billion   84  United States   
 7    6          Jim Walton*   $20.0 billion   56  United States   
 8    6         John Walton*   $20.0 billion   58  United States   
 9    6    S. Robson Walton*   $20.0 billion   60  United States   
 
        Source(s) of wealth  
 0                Microsoft  
 1       Berkshire Hathaway  
 2                 Aldi Süd  
 3  Kingdom Holding Company  
 4                Microsoft  
 5                 Wal-Mart  
 6                 Wal-Mart  
 7                 Wal-Mart  
 8                 Wal-Mart  
 9                 Wal-Mart  ,
     No.                    Name Net worth (USD)  Age    Nationality  \
 0     1              Bill Gates   $40.7 billion   47  United States   
 1     2          Warren Buffett   $30.5 billion   72  United States   
 2     3  Karl and Theo Albrecht   $25.6 billion   83        Germany   
 3     4              Paul Allen   $20.1 billion   50  United States   
 4     5     Al-Waleed bin Talal   $17.7 billion   46   Saudi Arabia   
 5     6           Larry Ellison   $16.6 billion   58  United States   
 6     7           Alice Walton*   $16.5 billion   54  United States   
 7     7           Helen Walton*   $16.5 billion   83  United States   
 8     7             Jim Walton*   $16.5 billion   55  United States   
 9     7            John Walton*   $16.5 billion   57  United States   
 10    7       S. Robson Walton*   $16.5 billion   59  United States   
 
         Source(s) of wealth  
 0                 Microsoft  
 1        Berkshire Hathaway  
 2                      Aldi  
 3                 Microsoft  
 4   Kingdom Holding Company  
 5        Oracle Corporation  
 6                  Wal-Mart  
 7                  Wal-Mart  
 8                  Wal-Mart  
 9                  Wal-Mart  
 10                 Wal-Mart  ,
    No.                    Name Net worth (USD)  Age    Nationality  \
 0    1              Bill Gates   $52.8 billion   46  United States   
 1    2          Warren Buffett   $35.0 billion   71  United States   
 2    3  Karl and Theo Albrecht   $26.8 billion   82        Germany   
 3    4              Paul Allen   $25.2 billion   49  United States   
 4    5           Larry Ellison   $23.5 billion   57  United States   
 5    6             Jim Walton*   $20.8 billion   54  United States   
 6    7            John Walton*   $20.7 billion   56  United States   
 7    8           Alice Walton*   $20.5 billion   53  United States   
 8    8       S. Robson Walton*   $20.5 billion   58  United States   
 9    8           Helen Walton*   $20.5 billion   82  United States   
 
   Source(s) of wealth  
 0           Microsoft  
 1  Berkshire Hathaway  
 2                Aldi  
 3           Microsoft  
 4  Oracle Corporation  
 5            Wal-Mart  
 6            Wal-Mart  
 7            Wal-Mart  
 8            Wal-Mart  
 9            Wal-Mart  ,
     No.                    Name Net worth (USD)  Age    Nationality  \
 0     1              Bill Gates   $58.7 billion   45  United States   
 1     2          Warren Buffett   $32.3 billion   70  United States   
 2     3              Paul Allen   $30.4 billion   48  United States   
 3     4           Larry Ellison   $26.0 billion   56  United States   
 4     5  Karl and Theo Albrecht   $25.0 billion   81        Germany   
 5     6     Al-Waleed bin Talal   $20.0 billion   44   Saudi Arabia   
 6     7             Jim Walton*   $18.8 billion   53  United States   
 7     8            John Walton*   $18.7 billion   55  United States   
 8     9       S. Robson Walton*   $18.6 billion   57  United States   
 9    10           Alice Walton*   $18.5 billion   52  United States   
 10   10           Helen Walton*   $18.5 billion   81  United States   
 
         Source(s) of wealth  
 0                 Microsoft  
 1        Berkshire Hathaway  
 2                 Microsoft  
 3        Oracle Corporation  
 4                      Aldi  
 5   Kingdom Holding Company  
 6                  Wal-Mart  
 7                  Wal-Mart  
 8                  Wal-Mart  
 9                  Wal-Mart  
 10                 Wal-Mart  ,
    No.                    Name Net worth (USD)  Age    Nationality  \
 0    1              Bill Gates   $60.0 billion   44  United States   
 1    2           Larry Ellison   $47.0 billion   55  United States   
 2    3              Paul Allen   $28.0 billion   47  United States   
 3    4          Warren Buffett   $25.6 billion   69  United States   
 4    5  Karl and Theo Albrecht   $20.0 billion   80        Germany   
 5    6     Al-Waleed bin Talal   $20.0 billion   43   Saudi Arabia   
 6    7        S. Robson Walton   $20.0 billion   57  United States   
 7    8           Masayoshi Son   $19.4 billion   43          Japan   
 8    9            Michael Dell   $19.1 billion   35  United States   
 9   10         Kenneth Thomson   $16.1 billion   77         Canada   
 
                  Source(s) of wealth  
 0                          Microsoft  
 1                 Oracle Corporation  
 2                          Microsoft  
 3                 Berkshire Hathaway  
 4                               Aldi  
 5            Kingdom Holding Company  
 6                           Wal-Mart  
 7  Softbank Capital, SoftBank Mobile  
 8                               Dell  
 9            The Thomson Corporation  ,
    No.[48]                    Name Net worth (USD)  Age    Nationality  \
 0        1              Bill Gates   $90.0 billion   43  United States   
 1        2          Warren Buffett   $36.0 billion   68  United States   
 2        3              Paul Allen   $30.0 billion   46  United States   
 3        4          Steven Ballmer   $19.5 billion   43  United States   
 4        5         Philip Anschutz   $16.5 billion   59  United States   
 5        6            Michael Dell   $16.5 billion   34  United States   
 6        7        S. Robson Walton   $15.8 billion   55  United States   
 7        8     Al-Waleed Bin Talal   $15.0 billion   42   Saudi Arabia   
 8        9  Karl and Theo Albrecht   $13.6 billion   79        Germany   
 9       10    Li Ka-shing & family   $12.6 billion   71      Hong Kong   
 
         Source(s) of wealth  
 0                 Microsoft  
 1        Berkshire Hathaway  
 2                 Microsoft  
 3                 Microsoft  
 4  The Anschutz Corporation  
 5                      Dell  
 6                  Wal-Mart  
 7   Kingdom Holding Company  
 8                      Aldi  
 9     CK Asset Holdings[49]  ,
    No.[48]                       Name Net worth (USD) Age    Nationality  \
 0        1                 Bill Gates   $51.0 billion  43  United States   
 1        2              Walton family   $48.0 billion   _  United States   
 2        3             Warren Buffett   $33.0 billion  67  United States   
 3        4                 Paul Allen   $21.0 billion  45  United States   
 4        5            Kenneth Thomson   $14.4 billion  74         Canada   
 5        6    Jay and Robert Pritzker   $13.5 billion   _  United States   
 6        7  Forrest Mars Sr. & family   $13.5 billion  94  United States   
 7        8        Al-Waleed Bin Talal   $13.3 billion  41   Saudi Arabia   
 8        9               Lee Shau-kee   $12.7 billion  70      Hong Kong   
 9       10     Karl and Theo Albrecht   $11.7 billion  78        Germany   
 
               Source(s) of wealth  
 0                       Microsoft  
 1                        Wal-Mart  
 2              Berkshire Hathaway  
 3                       Microsoft  
 4         Woodbridge Co. Ltd.[50]  
 5                       Hyatt[51]  
 6                  Mars, Inc.[52]  
 7         Kingdom Holding Company  
 8  Henderson Land Development[53]  
 9                            Aldi  ,
    No.[48]                       Name Net worth (USD) Age    Nationality  \
 0        1                 Bill Gates   $36.4 billion  42  United States   
 1        2              Walton family   $27.6 billion   _  United States   
 2        3             Warren Buffett   $23.2 billion  66  United States   
 3        4               Lee Shau-kee   $14.7 billion  69      Hong Kong   
 4        5                 Paul Allen   $14.1 billion  44  United States   
 5        6              Kwok brothers   $12.3 billion  48      Hong Kong   
 6        7                Haas family   $12.3 billion   _  United States   
 7        8  Forrest Mars Sr. & family   $12.0 billion  93  United States   
 8        9     Karl and Theo Albrecht   $11.5 billion  77        Germany   
 9       10      Tsai Wan-lin & family   $11.3 billion  73         Taiwan   
 
               Source(s) of wealth  
 0                       Microsoft  
 1                        Wal-Mart  
 2              Berkshire Hathaway  
 3  Henderson Land Development[53]  
 4                       Microsoft  
 5     Sun Hung Kai Properties[54]  
 6           Levi Strauss & Co[55]  
 7                  Mars, Inc.[52]  
 8                            Aldi  
 9       Cathay Life Insurance[56]  ,
    No.[48]                             Name Net worth (USD) Age  \
 0        1                    Walton family   $22.9 billion   _   
 1        2                       Bill Gates   $18.0 billion  41   
 2        3                   Warren Buffett   $15.3 billion  65   
 3        4  Oeri, Hoffman & Sacher families   $13.1 billion   _   
 4        5                     Lee Shau-kee   $12.7 billion  68   
 5        6            Tsai Wan-lin & family   $12.2 billion  72   
 6        7                    Kwok brothers   $11.2 billion   _   
 7        8             Li Ka-shing & family   $10.6 billion  68   
 8        9                Yoshiaki Tsutsumi    $9.2 billion  62   
 9       10           Karl and Theo Albrecht    $9.0 billion  76   
 
      Nationality             Source(s) of wealth  
 0  United States                        Wal-Mart  
 1  United States                       Microsoft  
 2  United States              Berkshire Hathaway  
 3    Switzerland                       Roche[57]  
 4      Hong Kong  Henderson Land Development[53]  
 5         Taiwan       Cathay Life Insurance[56]  
 6      Hong Kong     Sun Hung Kai Properties[54]  
 7      Hong Kong           CK Asset Holdings[49]  
 8          Japan               Seibu Railway[58]  
 9        Germany                        Aldi[59]  ,
    No.[48]                          Name Net worth (USD)    Nationality  \
 0        1                 Walton family   $23.5 billion  United States   
 1        2                    Bill Gates   $12.9 billion  United States   
 2        3                Warren Buffett   $10.7 billion  United States   
 3        4          Hans and Gad Rausing    $9.0 billion         Sweden   
 4        5             Yoshiaki Tsutsumi    $9.0 billion          Japan   
 5        6  Paul Sacher & Hoffman family    $8.6 billion    Switzerland   
 6        7         Tsai Wan-lin & family    $8.5 billion         Taiwan   
 7        8               Kenneth Thomson    $6.5 billion         Canada   
 8        9                  Lee Shau-kee    $6.5 billion      Hong Kong   
 9       10                 Chung Ju-yung    $6.2 billion    South Korea   
 
           Source(s) of wealth  
 0                    Wal-Mart  
 1                   Microsoft  
 2          Berkshire Hathaway  
 3                   Tetra Pak  
 4           Seibu Corporation  
 5           Hoffmann-La Roche  
 6              Lin Yuan Group  
 7         Thomson Corporation  
 8  Henderson Land Development  
 9                     Hyundai  ,
    No.[48]                          Name Net worth (USD)    Nationality  \
 0        1                 Walton family   $22.6 billion  United States   
 1        2                du Pont family    $9.0 billion  United States   
 2        3          Hans and Gad Rausing    $9.0 billion         Sweden   
 3        4             Yoshiaki Tsutsumi    $8.5 billion          Japan   
 4        5                    Bill Gates    $8.2 billion  United States   
 5        6                Warren Buffett    $7.9 billion  United States   
 6        7  Paul Sacher & Hoffman family    $7.8 billion    Switzerland   
 7        8         Tsai Wan-lin & family    $7.5 billion         Taiwan   
 8        9        Karl and Theo Albrecht    $7.3 billion        Germany   
 9       10                   Carlos Slim    $6.6 billion         Mexico   
 
           Source(s) of wealth  
 0                    Wal-Mart  
 1                      DuPont  
 2                   Tetra Pak  
 3           Seibu Corporation  
 4                   Microsoft  
 5          Berkshire Hathaway  
 6           Hoffmann-La Roche  
 7              Lin Yuan Group  
 8                        Aldi  
 9  América Móvil, Grupo Carso  ,
    No.[48]                         Name Net worth (USD)    Nationality  \
 0        1                Walton family   $25.3 billion  United States   
 1        2                  Mars family    $9.2 billion  United States   
 2        3            Yoshiaki Tsutsumi    $9.0 billion          Japan   
 3        4               du Pont family    $8.6 billion  United States   
 4        5        Minoru and Akira Mori    $7.5 billion          Japan   
 5        6                   Bill Gates    $7.4 billion  United States   
 6        7   Samuel and Donald Newhouse    $7.0 billion  United States   
 7        8  Sid and Lee Bass & brothers    $6.8 billion  United States   
 8        9               Warren Buffett    $6.6 billion  United States   
 9       10                  Erivan Haub    $6.2 billion        Germany   
 
      Source(s) of wealth  
 0               Wal-Mart  
 1             Mars, Inc.  
 2      Seibu Corporation  
 3                 DuPont  
 4  Mori Building Company  
 5              Microsoft  
 6   Advance Publications  
 7    Richardson Gasoline  
 8     Berkshire Hathaway  
 9       Tengelmann Group  ,
    No.[48]                      Name Net worth (USD)     Nationality  \
 0        1             Walton family   $23.8 billion   United States   
 1        2           Taikichiro Mori   $13.0 billion           Japan   
 2        3         Yoshiaki Tsutsumi   $10.0 billion           Japan   
 3        4      Hans and Gad Rausing    $7.0 billion          Sweden   
 4        5               Erivan Haub    $6.9 billion         Germany   
 5        6             Haniel family    $6.4 billion         Germany   
 6        7                Bill Gates    $6.4 billion   United States   
 7        8  David Sainsbury & family    $6.2 billion  United Kingdom   
 8        9           Kenneth Thomson    $6.2 billion          Canada   
 9       10              Shin Kyuk-ho    $6.0 billion     South Korea   
 
      Source(s) of wealth  
 0               Wal-Mart  
 1  Mori Building Company  
 2      Seibu Corporation  
 3              Tetra Pak  
 4       Tengelmann Group  
 5    Franz Haniel & Cie.  
 6              Microsoft  
 7            Sainsbury's  
 8    Thomson Corporation  
 9      Lotte Corporation  ,
    No.[48]                          Name Net worth (USD)    Nationality  \
 0        1                 Walton family   $18.5 billion  United States   
 1        2               Taikichiro Mori   $15.0 billion          Japan   
 2        3             Yoshiaki Tsutsumi   $14.0 billion          Japan   
 3        4                du Pont family   $10.0 billion  United States   
 4        5          Hans and Gad Rausing    $9.0 billion         Sweden   
 5        6          Kitaro Watanabe [ja]    $7.7 billion          Japan   
 6        7     Paul Reichmann & brothers    $7.1 billion         Canada   
 7        8               Kenneth Thomson    $6.8 billion         Canada   
 8        9  Kenkichi Nakajima [Wikidata]    $6.1 billion          Japan   
 9       10                  Shin Kyuk-ho    $6.0 billion    South Korea   
 
      Source(s) of wealth  
 0               Wal-Mart  
 1  Mori Building Company  
 2      Seibu Corporation  
 3                 DuPont  
 4              Tetra Pak  
 5         Azabu Building  
 6         Olympia & York  
 7    Thomson Corporation  
 8      Heiwa Corporation  
 9      Lotte Corporation  ,
    No.[48]                          Name Net worth (USD)    Nationality  \
 0        1             Yoshiaki Tsutsumi   $16.0 billion          Japan   
 1        2               Taikichiro Mori   $14.6 billion          Japan   
 2        3                 Walton family   $13.5 billion  United States   
 3        4                du Pont family   $10.0 billion  United States   
 4        5          Hans and Gad Rausing    $9.6 billion         Sweden   
 5        6          Kitaro Watanabe [ja]    $9.2 billion          Japan   
 6        7     Paul Reichmann & brothers    $9.0 billion         Canada   
 7        8  Kenkichi Nakajima [Wikidata]    $8.4 billion          Japan   
 8        9                  Shin Kyuk-ho    $7.5 billion    South Korea   
 9       10                Eitaro Itoyama    $5.8 billion          Japan   
 
      Source(s) of wealth  
 0      Seibu Corporation  
 1  Mori Building Company  
 2               Wal-Mart  
 3                 DuPont  
 4              Tetra Pak  
 5         Azabu Building  
 6         Olympia & York  
 7      Heiwa Corporation  
 8      Lotte Corporation  
 9       Shin Nihon Kanko  ,
    No.[60]                             Name Net worth (USD)    Nationality  \
 0        1                Yoshiaki Tsutsumi   $15.0 billion          Japan   
 1        2                  Taikichiro Mori   $14.2 billion          Japan   
 2        3              Sam Walton & family    $8.7 billion  United States   
 3        4               Reichmann brothers    $8.0 billion         Canada   
 4        4                     Shin Kyuk-ho    $8.0 billion    South Korea   
 5        6     Hirotomo Takei [ja] & family    $7.8 billion          Japan   
 6        7             Kitaro Watanabe [ja]   $7.0 billion+          Japan   
 7        8  Haruhiko Yoshimoto [ja]& family    $7.0 billion          Japan   
 8        8             Hans and Gad Rausing    $7.0 billion         Sweden   
 9       10                   Eitaro Itoyama    $6.6 billion          Japan   
 
      Source(s) of wealth  
 0      Seibu Corporation  
 1  Mori Building Company  
 2               Wal-Mart  
 3         Olympia & York  
 4      Lotte Corporation  
 5                 Chisan  
 6         Azabu Building  
 7            Real estate  
 8              Tetra Pak  
 9       Shin Nihon Kanko  ,
    No.[61]                     Name Net worth (USD)    Nationality  \
 0        1        Yoshiaki Tsutsumi   $18.9 billion          Japan   
 1        2          Taikichiro Mori   $18.0 billion          Japan   
 2        3       Reichmann brothers    $9.0 billion         Canada   
 3        4             Shin Kyuk-ho    $8.0 billion    South Korea   
 4        4             K. C. Irving    $8.0 billion         Canada   
 5        6  Haruhiko Yoshimoto [ja]    $7.8 billion          Japan   
 6        7               Sam Walton    $6.5 billion  United States   
 7        8             Tsai Wan-lin    $5.6 billion         Taiwan   
 8        9           Eitaro Itoyama   $5.0 billion+          Japan   
 9       10     Kitaro Watanabe [ja]    $5.2 billion          Japan   
 
      Source(s) of wealth  
 0      Seibu Corporation  
 1  Mori Building Company  
 2         Olympia & York  
 3      Lotte Corporation  
 4             Irving Oil  
 5            Real estate  
 6               Wal-Mart  
 7         Lin Yuan Group  
 8       Shin Nihon Kanko  
 9         Azabu Building  ,
    No.[62]                     Name Net worth (USD)   Nationality  \
 0        1        Yoshiaki Tsutsumi     $20 billion         Japan   
 1        2          Taikichiro Mori     $15 billion         Japan   
 2        3   Shigeru Kobayashi [ja]    $7.5 billion         Japan   
 3        4  Haruhiko Yoshimoto [ja]    $7.0 billion         Japan   
 4        5  Salim Ahmed bin Mahfouz    $6.2 billion  Saudi Arabia   
 5        6     Hans and Gad Rausing    $6.0 billion        Sweden   
 6        7           Paul Reichmann    $6.0 billion        Canada   
 7        8   Yohachiro Iwasaki [ja]    $5.6 billion         Japan   
 8        9          Kenneth Thomson    $5.4 billion        Canada   
 9       10               Keizo Saji    $4.0 billion         Japan   
 
         Source(s) of wealth  
 0         Seibu Corporation  
 1     Mori Building Company  
 2         Shuwa Corporation  
 3               Real estate  
 4  National Commercial Bank  
 5                 Tetra Pak  
 6            Olympia & York  
 7               Real estate  
 8       Thomson Corporation  
 9                   Suntory  ,
                                 Year            Number of billionaires  \
 0                            2024[2]                              2781   
 1                            2023[7]                              2640   
 2                            2022[6]                              2668   
 3                           2021[12]                              2755   
 4                               2020                              2095   
 5                               2019                              2153   
 6                               2018                              2208   
 7                               2017                              2043   
 8                               2016                              1810   
 9                           2015[19]                              1826   
 10                          2014[68]                              1645   
 11                          2013[69]                              1426   
 12                              2012                              1226   
 13                              2011                              1210   
 14                              2010                              1011   
 15                              2009                               793   
 16                              2008                              1125   
 17                              2007                               946   
 18                              2006                               793   
 19                              2005                               691   
 20                              2004                               587   
 21                              2003                               476   
 22                              2002                               497   
 23                              2001                               538   
 24                              2000                               470   
 25  Sources: Forbes.[19][68][67][69]  Sources: Forbes.[19][68][67][69]   
 
           Group's combined net worth  
 0                     $14.2 trillion  
 1                     $12.2 trillion  
 2                     $12.7 trillion  
 3                     $13.1 trillion  
 4                      $8.0 trillion  
 5                      $8.7 trillion  
 6                      $9.1 trillion  
 7                      $7.7 trillion  
 8                      $6.5 trillion  
 9                      $7.1 trillion  
 10                     $6.4 trillion  
 11                     $5.4 trillion  
 12                     $4.6 trillion  
 13                     $4.5 trillion  
 14                     $3.6 trillion  
 15                     $2.4 trillion  
 16                     $4.4 trillion  
 17                     $3.5 trillion  
 18                     $2.6 trillion  
 19                     $2.2 trillion  
 20                     $1.9 trillion  
 21                     $1.4 trillion  
 22                     $1.5 trillion  
 23                     $1.8 trillion  
 24                      $898 billion  
 25  Sources: Forbes.[19][68][67][69]  ,
    vteForbes magazine                               vteForbes magazine.1  \
 0           Companies                      Forbes Global 2000 Forbes 500   
 1              People  The World's Billionaires Forbes 400 30 Under 3...   
 2       Entertainment  General Forbes Top 40 Celebrity 100 Forbes Fic...   
 3             General  Forbes Top 40 Celebrity 100 Forbes Fictional 1...   
 4             Fashion                                Highest-paid models   
 5                Film                                Highest-paid actors   
 6               Music                             Highest-paid musicians   
 7               Sport  Highest-paid athletes Most valuable sports tea...   
 8           Education                             America's Top Colleges   
 9          Technology                Midas List (Tech's Top Deal Makers)   
 10     Related topics  Lists of people by net worth Wealthiest musica...   
 
     vteForbes magazine.2  
 0                    NaN  
 1                    NaN  
 2                    NaN  
 3                    NaN  
 4                    NaN  
 5                    NaN  
 6                    NaN  
 7                    NaN  
 8                    NaN  
 9                    NaN  
 10                   NaN  ,
          0                                                  1
 0  General  Forbes Top 40 Celebrity 100 Forbes Fictional 1...
 1  Fashion                                Highest-paid models
 2     Film                                Highest-paid actors
 3    Music                             Highest-paid musicians
 4    Sport  Highest-paid athletes Most valuable sports tea...,
   vteBillionaires                                  vteBillionaires.1
 0  By citizenship  Argentina Austria Belgium Brazil Canada Chile ...
 1       By region            World Africa ASEAN Europe Latin America
 2    Forbes lists  The World's Billionaires 2010 2011 2012 2013 2...
 3           Lists  Black Bloomberg Billionaires Index Financial R...
 4           Other                             Billionaire space race,
                                     vteExtreme wealth  \
 0                                            Concepts   
 1   Capital accumulation Overaccumulation Economic...   
 2                                              People   
 3                                              Wealth   
 4                                               Lists   
 5                                              People   
 6                                       Organizations   
 7                                               Other   
 8                                             Related   
 9   Diseases of affluence Affluenza Acquired situa...   
 10                                       Philanthropy   
 11                                            Sayings   
 12                                              Media   
 13                                Category by country   
 
                                   vteExtreme wealth.1  
 0   Capital accumulation Overaccumulation Economic...  
 1   Capital accumulation Overaccumulation Economic...  
 2   Billionaire Captain of industry High-net-worth...  
 3   Concentration Distribution Dynastic Effect Geo...  
 4   People Forbes list of billionaires List of cen...  
 5   Forbes list of billionaires List of centibilli...  
 6   Largest companies by revenue Largest corporate...  
 7   Cities by number of billionaires Countries by ...  
 8   Diseases of affluence Affluenza Acquired situa...  
 9   Diseases of affluence Affluenza Acquired situa...  
 10  Gospel of Wealth The Giving Pledge Philanthroc...  
 11  The rich get richer and the poor get poorer So...  
 12  Das Kapital Plutus Greek god of wealth Supercl...  
 13                                Category by country  ,
                                                    0  \
 0  Capital accumulation Overaccumulation Economic...   
 1                                             People   
 2                                             Wealth   
 
                                                    1  
 0  Capital accumulation Overaccumulation Economic...  
 1  Billionaire Captain of industry High-net-worth...  
 2  Concentration Distribution Dynastic Effect Geo...  ,
                0                                                  1
 0         People  Forbes list of billionaires List of centibilli...
 1  Organizations  Largest companies by revenue Largest corporate...
 2          Other  Cities by number of billionaires Countries by ...,
                                                    0  \
 0  Diseases of affluence Affluenza Acquired situa...   
 1                                       Philanthropy   
 2                                            Sayings   
 3                                              Media   
 
                                                    1  
 0  Diseases of affluence Affluenza Acquired situa...  
 1  Gospel of Wealth The Giving Pledge Philanthroc...  
 2  The rich get richer and the poor get poorer So...  
 3  Das Kapital Plutus Greek god of wealth Supercl...  ]
In [7]:
df_list[0]
Out[7]:
0 1
0 List of the world's billionaires, ranked in or... List of the world's billionaires, ranked in or...
1 The net worth of the world's billionaires incr... The net worth of the world's billionaires incr...
2 Publication details Publication details
3 Publisher Whale Media InvestmentsForbes family
4 Publication Forbes
5 First published March 1987[1]
6 Latest publication April 2, 2024
7 Current list details (2024)[2] Current list details (2024)[2]
8 Wealthiest Bernard Arnault
9 Net worth (1st) US$233 billion
10 Number of billionaires 2,781 (from 2640)
11 Total list net worth value US$14.2 trillion (from US$12.2 trillion)
12 Number of women 383
13 Number of men 2398
14 New members to the list 141
15 Forbes: The World's Billionaires website Forbes: The World's Billionaires website
In [8]:
df = df_list[2]
df.head()
Out[8]:
No. Name Net worth (USD) Age Nationality Primary source(s) of wealth
0 1 Bernard Arnault & family $233 billion 75 France LVMH
1 2 Elon Musk $195 billion 52 South Africa  Canada  United States Tesla, SpaceX
2 3 Jeff Bezos $194 billion 60 United States Amazon
3 4 Mark Zuckerberg $177 billion 39 United States Meta Platforms
4 5 Larry Ellison $141 billion 79 United States Oracle Corporation
In [9]:
# Memilih Tabel tertentu
pd.read_html(url, match='Number and combined net worth of billionaires by year')[0].head()
Out[9]:
Year Number of billionaires Group's combined net worth
0 2024[2] 2781 $14.2 trillion
1 2023[7] 2640 $12.2 trillion
2 2022[6] 2668 $12.7 trillion
3 2021[12] 2755 $13.1 trillion
4 2020 2095 $8.0 trillion

Contoh Studi Kasus¶

  • Misal seorang Data Scientist ditugaskan untuk menentukan investasi properti terbaik.
  • Tujuan analisanya adalah menemukan harga rumah yang lebih rendah dari harga pasaran
  • Asumsikan kita memiliki data harga rumah yang ditawarkan dan variabel-variabel terkait lainnya.
  • Untuk membuat keputusan investasi, kita akan melakukan EDA pada data yang ada.

Contoh Kasus Data Harga Property Rumah¶

  • Sumber Data: http://byebuyhome.com/
  • Objective: menemukan harga rumah yang berada di bawah pasaran.
  • Variable:
  • Dist_Taxi – distance to nearest taxi stand from the property
  • Dist_Market – distance to nearest grocery market from the property
  • Dist_Hospital – distance to nearest hospital from the property
  • Carpet – carpet area of the property in square feet
  • Builtup – built-up area of the property in square feet
  • Parking – type of car parking available with the property
  • City_Category – categorization of the city based on the size
  • Rainfall – annual rainfall in the area where property is located
  • House_Price – price at which the property was sold
In [10]:
# Importing Some Python Modules
import warnings; warnings.simplefilter('ignore')
import scipy, itertools, pandas as pd, matplotlib.pyplot as plt, seaborn as sns, numpy as np
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler


plt.style.use('bmh'); sns.set()
In [11]:
file_ = 'data/price.csv'
try: # Running Locally, yakinkan "file_" berada di folder "data"
    price = pd.read_csv(file_, low_memory = False, encoding='utf8')
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/price.csv
    price = pd.read_csv(file_, low_memory = False, encoding='utf8')
    
N, P = price.shape # Ukuran Data
print('baris = ', N, ', Kolom (jumlah variabel) = ', P)
print("Tipe Variabe df = ", type(price))
# "Mengintip" beberapa data pertamanya
price.head(9)
baris =  936 , Kolom (jumlah variabel) =  10
Tipe Variabe df =  <class 'pandas.core.frame.DataFrame'>
Out[11]:
Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
0 1 9796.0 5250.0 10703.0 1659.0 1961.0 Open CAT B 530 6649000
1 2 8294.0 8186.0 12694.0 1461.0 1752.0 Not Provided CAT B 210 3982000
2 3 11001.0 14399.0 16991.0 1340.0 1609.0 Not Provided CAT A 720 5401000
3 4 8301.0 11188.0 12289.0 1451.0 1748.0 Covered CAT B 620 5373000
4 5 10510.0 12629.0 13921.0 1770.0 2111.0 Not Provided CAT B 450 4662000
5 6 6665.0 5142.0 9972.0 1442.0 1733.0 Open CAT B 760 4526000
6 7 13153.0 11869.0 17811.0 1542.0 1858.0 No Parking CAT A 1030 7224000
7 8 5882.0 9948.0 13315.0 1261.0 1507.0 Open CAT C 1020 3772000
8 9 7495.0 11589.0 13370.0 1090.0 1321.0 Not Provided CAT B 680 4631000
In [12]:
# "Mengintip" beberapa data akhirnya
price.tail(4)
Out[12]:
Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
932 933 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
933 934 9205.0 10418.0 14496.0 1118.0 1337.0 Open CAT A 560 7227000
934 935 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
935 936 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
In [13]:
# chosen at random
price.sample(10)
Out[13]:
Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
837 838 10180.0 11465.0 14967.0 1722.0 2078.0 Open CAT A 770 7361000
678 679 7288.0 9560.0 12531.0 1989.0 2414.0 No Parking CAT A 860 11632000
904 905 12834.0 11668.0 17029.0 1439.0 1732.0 Open CAT A 1170 8058000
304 305 4019.0 7091.0 8720.0 902.0 1093.0 Covered CAT A 1210 6464000
253 254 4906.0 10462.0 12246.0 1539.0 1848.0 Open CAT B 750 4714000
335 336 9464.0 10762.0 13998.0 1208.0 1459.0 Open CAT C 930 4149000
776 777 7374.0 11516.0 14480.0 1450.0 1728.0 Not Provided CAT C 930 3856000
861 862 3284.0 7836.0 9240.0 1671.0 2024.0 Not Provided CAT B 620 6310000
845 846 12189.0 13518.0 17420.0 1762.0 NaN Covered CAT A 790 8214000
16 17 11079.0 13102.0 13076.0 1578.0 1907.0 Open CAT A 1440 7725000

Perhatikan perintah ".sample" bisa untuk sampling training data¶

In [14]:
df_train = price.sample(300)
df_train.head()
Out[14]:
Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
349 350 10948.0 11622.0 14879.0 1624.0 1973.0 No Parking CAT C 1440 4466000
59 60 8458.0 13941.0 15721.0 1417.0 1701.0 Open CAT B 740 4867000
79 80 4589.0 12404.0 12558.0 1539.0 1833.0 Not Provided CAT A 650 8484000
922 923 9538.0 11551.0 12839.0 1655.0 1986.0 Covered CAT B 1150 7743000
411 412 7083.0 7275.0 10474.0 1264.0 1502.0 Open CAT A 800 7941000

Perhatikan nama indexnya (kolom pertama) ... ini penting untuk memahami struktur dataframe dengan baik¶

In [15]:
try:
    print(df_train.loc[798])
except Exception as err_:
    print(err_)
Observation               799
Dist_Taxi              9240.0
Dist_Market            9365.0
Dist_Hospital         13101.0
Carpet                 1596.0
Builtup                1939.0
Parking          Not Provided
City_Category           CAT A
Rainfall                  960
House_Price           7976000
Name: 798, dtype: object
In [16]:
df_train.iloc[0]#['Parking']
Out[16]:
Observation             350
Dist_Taxi           10948.0
Dist_Market         11622.0
Dist_Hospital       14879.0
Carpet               1624.0
Builtup              1973.0
Parking          No Parking
City_Category         CAT C
Rainfall               1440
House_Price         4466000
Name: 349, dtype: object
In [17]:
# Sehingga bisa digunakan untuk melakukan hal ini
df_test = price.loc[list(set(price.index) - set(df_train.index))]
print(df_test.shape)
df_test.head()
(636, 10)
Out[17]:
Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
0 1 9796.0 5250.0 10703.0 1659.0 1961.0 Open CAT B 530 6649000
2 3 11001.0 14399.0 16991.0 1340.0 1609.0 Not Provided CAT A 720 5401000
5 6 6665.0 5142.0 9972.0 1442.0 1733.0 Open CAT B 760 4526000
8 9 7495.0 11589.0 13370.0 1090.0 1321.0 Not Provided CAT B 680 4631000
10 11 4278.0 10646.0 8243.0 1187.0 1439.0 Covered CAT A 1090 7128000

Warning!!!¶

Walau bisa dilakukan, tapi "tidak dianjurkan" ==> Mengapa?¶

In [18]:
# Kita juga meng-iterasikan sebuah dataframe (jika diperlukan)
for i, d in price.iterrows():
    print(i, d.House_Price)
    if i>2:
        break
d
0 6649000
1 3982000
2 5401000
3 5373000
Out[18]:
Observation            4
Dist_Taxi         8301.0
Dist_Market      11188.0
Dist_Hospital    12289.0
Carpet            1451.0
Builtup           1748.0
Parking          Covered
City_Category      CAT B
Rainfall             620
House_Price      5373000
Name: 3, dtype: object

Tips¶

Melakukan looping seperti ini secara umum tidak dianjurkan (karena cenderung lambat), namun sangat bermanfaat untuk transformasi variabel atau preprocessing yang rumit/kompleks.¶

Removing a variable(s)¶

In [19]:
# perhatikan perintahnya tidak menggunakan tanda "()" ==> Properties 
price.columns
Out[19]:
Index(['Observation', 'Dist_Taxi', 'Dist_Market', 'Dist_Hospital', 'Carpet',
       'Builtup', 'Parking', 'City_Category', 'Rainfall', 'House_Price'],
      dtype='object')
In [20]:
# Drop kolom pertama karena tidak berguna (hanya index)
price.drop("Observation", axis=1, inplace=True)
#price = price.drop("Observation", axis=1) # ==> sangat tidak anjurkan
price.head()
Out[20]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
0 9796.0 5250.0 10703.0 1659.0 1961.0 Open CAT B 530 6649000
1 8294.0 8186.0 12694.0 1461.0 1752.0 Not Provided CAT B 210 3982000
2 11001.0 14399.0 16991.0 1340.0 1609.0 Not Provided CAT A 720 5401000
3 8301.0 11188.0 12289.0 1451.0 1748.0 Covered CAT B 620 5373000
4 10510.0 12629.0 13921.0 1770.0 2111.0 Not Provided CAT B 450 4662000

Mengoreksi Tipe variabel¶

In [21]:
# tipe data di setiap kolom
# Wajib di periksa apakah tipe datanya sudah tepat?
# Perhatikan df sebagaimana semua variable di Python diperlakukan seperti object
price.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 936 entries, 0 to 935
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Dist_Taxi      923 non-null    float64
 1   Dist_Market    923 non-null    float64
 2   Dist_Hospital  935 non-null    float64
 3   Carpet         928 non-null    float64
 4   Builtup        921 non-null    float64
 5   Parking        936 non-null    object 
 6   City_Category  936 non-null    object 
 7   Rainfall       936 non-null    int64  
 8   House_Price    936 non-null    int64  
dtypes: float64(5), int64(2), object(2)
memory usage: 65.9+ KB
In [22]:
price.dtypes
Out[22]:
Dist_Taxi        float64
Dist_Market      float64
Dist_Hospital    float64
Carpet           float64
Builtup          float64
Parking           object
City_Category     object
Rainfall           int64
House_Price        int64
dtype: object
In [23]:
# dataframe types: https://pbpython.com/pandas_dtypes.html
price['Parking'] = price['Parking'].astype('category')
price['City_Category'] = price['City_Category'].astype('category')
price.dtypes
Out[23]:
Dist_Taxi         float64
Dist_Market       float64
Dist_Hospital     float64
Carpet            float64
Builtup           float64
Parking          category
City_Category    category
Rainfall            int64
House_Price         int64
dtype: object

image source: http://writer.lk/portfolio-item/statistics/¶

Central Tendency is not enough¶

Source: https://www.youtube.com/PeristiwaAlamAneh

Keragaman Data¶

Statistika Deskriptif¶

In [24]:
price.describe()
Out[24]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Rainfall House_Price
count 923.000000 923.000000 935.000000 928.000000 921.000000 936.000000 9.360000e+02
mean 8239.512459 11039.122427 13082.894118 1511.558190 1794.610206 786.730769 6.089048e+06
std 2561.188953 2565.058074 2586.507654 789.370074 467.395372 266.218109 5.015046e+06
min 146.000000 1666.000000 3227.000000 775.000000 932.000000 -110.000000 3.000000e+04
25% 6481.500000 9366.000000 11308.000000 1318.000000 1583.000000 600.000000 4.661000e+06
50% 8233.000000 11166.000000 13179.000000 1481.000000 1775.000000 780.000000 5.879500e+06
75% 9967.000000 12688.500000 14848.000000 1653.500000 1982.000000 970.000000 7.187250e+06
max 20662.000000 20945.000000 23294.000000 24300.000000 12730.000000 1560.000000 1.500000e+08
In [25]:
# Statistika Sederhana dari data "Numerik"-nya
price.describe(include='all').transpose()
Out[25]:
count unique top freq mean std min 25% 50% 75% max
Dist_Taxi 923.0 NaN NaN NaN 8239.512459 2561.188953 146.0 6481.5 8233.0 9967.0 20662.0
Dist_Market 923.0 NaN NaN NaN 11039.122427 2565.058074 1666.0 9366.0 11166.0 12688.5 20945.0
Dist_Hospital 935.0 NaN NaN NaN 13082.894118 2586.507654 3227.0 11308.0 13179.0 14848.0 23294.0
Carpet 928.0 NaN NaN NaN 1511.55819 789.370074 775.0 1318.0 1481.0 1653.5 24300.0
Builtup 921.0 NaN NaN NaN 1794.610206 467.395372 932.0 1583.0 1775.0 1982.0 12730.0
Parking 936 4 Open 373 NaN NaN NaN NaN NaN NaN NaN
City_Category 936 3 CAT B 365 NaN NaN NaN NaN NaN NaN NaN
Rainfall 936.0 NaN NaN NaN 786.730769 266.218109 -110.0 600.0 780.0 970.0 1560.0
House_Price 936.0 NaN NaN NaN 6089048.076923 5015045.744038 30000.0 4661000.0 5879500.0 7187250.0 150000000.0

Hati-hati¶

  • Modus tidak selalu ada
  • Kapan menggunakan Mean dan Median (outlier-wise)
  • Min/max dapat digunakan untuk mendeteksi Noise/Outlier
  • Apa beda noise dan outlier?
  • Mengapa outlier/noise harus ditangani saat preprocessing?
In [26]:
# ini adalah parameter tambahan jika kita juga ingin mendapatkan statistik sederhana seluruh datanya
# (termasuk data kategorik)
price[['Dist_Taxi','Parking']].describe(include='all')
Out[26]:
Dist_Taxi Parking
count 923.000000 936
unique NaN 4
top NaN Open
freq NaN 373
mean 8239.512459 NaN
std 2561.188953 NaN
min 146.000000 NaN
25% 6481.500000 NaN
50% 8233.000000 NaN
75% 9967.000000 NaN
max 20662.000000 NaN

Distribusi nilai pada setiap variabel kategorik¶

In [27]:
price['Parking'].unique()
Out[27]:
['Open', 'Not Provided', 'Covered', 'No Parking']
Categories (4, object): ['Covered', 'No Parking', 'Not Provided', 'Open']
In [28]:
a = price['Parking']
In [29]:
dir(a)
Out[29]:
['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__column_consortium_standard__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '__xor__',
 '_accessors',
 '_accum_func',
 '_agg_examples_doc',
 '_agg_see_also_doc',
 '_align_for_op',
 '_align_frame',
 '_align_series',
 '_append',
 '_arith_method',
 '_as_manager',
 '_attrs',
 '_binop',
 '_cacher',
 '_can_hold_na',
 '_check_inplace_and_allows_duplicate_labels',
 '_check_is_chained_assignment_possible',
 '_check_label_or_level_ambiguity',
 '_check_setitem_copy',
 '_clear_item_cache',
 '_clip_with_one_bound',
 '_clip_with_scalar',
 '_cmp_method',
 '_consolidate',
 '_consolidate_inplace',
 '_construct_axes_dict',
 '_construct_result',
 '_constructor',
 '_constructor_expanddim',
 '_constructor_expanddim_from_mgr',
 '_constructor_from_mgr',
 '_data',
 '_deprecate_downcast',
 '_dir_additions',
 '_dir_deletions',
 '_drop_axis',
 '_drop_labels_or_levels',
 '_duplicated',
 '_find_valid_index',
 '_flags',
 '_flex_method',
 '_from_mgr',
 '_get_axis',
 '_get_axis_name',
 '_get_axis_number',
 '_get_axis_resolvers',
 '_get_block_manager_axis',
 '_get_bool_data',
 '_get_cacher',
 '_get_cleaned_column_resolvers',
 '_get_index_resolvers',
 '_get_label_or_level_values',
 '_get_numeric_data',
 '_get_rows_with_mask',
 '_get_value',
 '_get_values_tuple',
 '_get_with',
 '_getitem_slice',
 '_gotitem',
 '_hidden_attrs',
 '_indexed_same',
 '_info_axis',
 '_info_axis_name',
 '_info_axis_number',
 '_init_dict',
 '_init_mgr',
 '_inplace_method',
 '_internal_names',
 '_internal_names_set',
 '_is_cached',
 '_is_copy',
 '_is_label_or_level_reference',
 '_is_label_reference',
 '_is_level_reference',
 '_is_mixed_type',
 '_is_view',
 '_is_view_after_cow_rules',
 '_item_cache',
 '_ixs',
 '_logical_func',
 '_logical_method',
 '_map_values',
 '_maybe_update_cacher',
 '_memory_usage',
 '_metadata',
 '_mgr',
 '_min_count_stat_function',
 '_name',
 '_needs_reindex_multi',
 '_pad_or_backfill',
 '_protect_consolidate',
 '_reduce',
 '_references',
 '_reindex_axes',
 '_reindex_indexer',
 '_reindex_multi',
 '_reindex_with_indexers',
 '_rename',
 '_replace_single',
 '_repr_data_resource_',
 '_repr_latex_',
 '_reset_cache',
 '_reset_cacher',
 '_set_as_cached',
 '_set_axis',
 '_set_axis_name',
 '_set_axis_nocheck',
 '_set_is_copy',
 '_set_labels',
 '_set_name',
 '_set_value',
 '_set_values',
 '_set_with',
 '_set_with_engine',
 '_shift_with_freq',
 '_slice',
 '_stat_function',
 '_stat_function_ddof',
 '_take_with_is_copy',
 '_to_latex_via_styler',
 '_typ',
 '_update_inplace',
 '_validate_dtype',
 '_values',
 '_where',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'array',
 'asfreq',
 'asof',
 'astype',
 'at',
 'at_time',
 'attrs',
 'autocorr',
 'axes',
 'backfill',
 'between',
 'between_time',
 'bfill',
 'bool',
 'case_when',
 'cat',
 'clip',
 'combine',
 'combine_first',
 'compare',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'dtype',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'flags',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'hasnans',
 'head',
 'hist',
 'iat',
 'idxmax',
 'idxmin',
 'iloc',
 'index',
 'infer_objects',
 'info',
 'interpolate',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'list',
 'loc',
 'lt',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'name',
 'nbytes',
 'ndim',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pad',
 'pct_change',
 'pipe',
 'plot',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'set_flags',
 'shape',
 'shift',
 'size',
 'skew',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'str',
 'struct',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'values',
 'var',
 'view',
 'where',
 'xs']
In [30]:
a.value_counts()
Out[30]:
Parking
Open            373
Not Provided    230
Covered         188
No Parking      145
Name: count, dtype: int64
In [31]:
set(price['Parking'])
Out[31]:
{'Covered', 'No Parking', 'Not Provided', 'Open'}
In [32]:
# Distribusi tiap data
price['Parking'].value_counts()
# kita bisa juga visualisasikan informasi ini
Out[32]:
Parking
Open            373
Not Provided    230
Covered         188
No Parking      145
Name: count, dtype: int64

Bisa Juga menggunakan Fungsi Counter di Module Collections¶

In [33]:
from collections import Counter

Counter(price['Parking'])
Out[33]:
Counter({'Open': 373, 'Not Provided': 230, 'Covered': 188, 'No Parking': 145})
In [34]:
a = [1, 2, 3, 4, 3, 7]
Counter(a)
Out[34]:
Counter({3: 2, 1: 1, 2: 1, 4: 1, 7: 1})

Two-Way Tables (contingency tables)¶

In [35]:
CT = pd.crosstab(index=price["City_Category"], columns=price["Parking"])
CT
Out[35]:
Parking Covered No Parking Not Provided Open
City_Category
CAT A 75 51 82 122
CAT B 64 53 89 159
CAT C 49 41 59 92

Data Grouping-Slicing¶

In [36]:
# Slicing DataFrame - Just like query in SQL
price[price["City_Category"] == "CAT B"].describe()
# Bisa ditambahkan .drop("Parking", axis=1) untuk menghilangkan kolom dengan single value
Out[36]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Rainfall House_Price
count 358.000000 358.000000 365.000000 362.000000 358.000000 365.000000 3.650000e+02
mean 8101.061453 10713.675978 12880.435616 1565.709945 1831.016760 782.958904 5.919148e+06
std 2559.846491 2569.681709 2611.683801 1224.410669 649.957568 259.713517 7.675921e+06
min 604.000000 4950.000000 4922.000000 869.000000 1050.000000 0.000000 2.130000e+06
25% 6391.250000 8916.000000 11170.000000 1327.250000 1584.750000 590.000000 4.622000e+06
50% 8022.000000 10719.500000 12936.000000 1490.000000 1788.000000 770.000000 5.459000e+06
75% 9786.500000 12524.000000 14663.000000 1688.000000 2022.750000 960.000000 6.395000e+06
max 20662.000000 20945.000000 23294.000000 24300.000000 12730.000000 1560.000000 1.500000e+08
In [37]:
# Cara Lain
# Slicing DataFrame - Just like query in SQL
price[price["Parking"].isin(["Open","Covered"])].describe()
# Bisa ditambahkan .drop("Parking", axis=1) untuk menghilangkan kolom dengan single value
Out[37]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Rainfall House_Price
count 553.000000 553.000000 560.000000 555.000000 547.000000 561.000000 5.610000e+02
mean 8059.430380 10929.074141 12902.832143 1533.926126 1809.641682 800.338681 6.311734e+06
std 2617.056273 2546.474961 2512.450050 999.998159 554.337885 265.722854 6.323591e+06
min 146.000000 1666.000000 3227.000000 775.000000 932.000000 70.000000 3.000000e+04
25% 6209.000000 9154.000000 11263.750000 1321.500000 1592.500000 610.000000 4.773000e+06
50% 8081.000000 11008.000000 13056.500000 1490.000000 1787.000000 790.000000 6.024000e+06
75% 9858.000000 12616.000000 14576.750000 1659.000000 1983.500000 980.000000 7.399000e+06
max 20662.000000 20945.000000 23294.000000 24300.000000 12730.000000 1560.000000 1.500000e+08

Removing Duplicate Data¶

  • Banyak di temukan di sistem Big Data.
  • mempengaruhi model dan analisa yang berdasarkan frekuensi.
  • Terkadang kita sengaja membuat duplikasi (misal pada kasus imbalanced learning).

image source: http://www.dagdoo.org/excel-learning/power-query/

In [38]:
#mengecek apakah ada duplikat data?
print(price.shape)
price.duplicated().sum()
(936, 9)
Out[38]:
4
In [39]:
price[price.duplicated() == True]

# Perhatikan kalau sebelumnya kita tidak "Drop" var observasi, maka kita tidak akan mendapatkan duplikasi dengan cara ini.
Out[39]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
932 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
933 9205.0 10418.0 14496.0 1118.0 1337.0 Open CAT A 560 7227000
934 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
935 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
In [40]:
# Kita juga mencari duplicat hanya berdasarkan kolom-kolom tertentu saja

price[price.duplicated(subset=['House_Price'])]
Out[40]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
187 4917.0 7195.0 9468.0 1704.0 2032.0 Covered CAT C 590 4830000
199 8704.0 13572.0 12349.0 1666.0 2000.0 Open CAT C 480 3973000
213 10187.0 12921.0 13539.0 1321.0 1579.0 Covered CAT B 770 6889000
240 6571.0 10429.0 11465.0 1350.0 1634.0 Open CAT B 880 7712000
244 10612.0 8229.0 15696.0 1366.0 1649.0 Not Provided CAT B 940 5278000
... ... ... ... ... ... ... ... ... ...
927 12176.0 8518.0 15673.0 1582.0 1910.0 Covered CAT C 1080 6639000
932 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
933 9205.0 10418.0 14496.0 1118.0 1337.0 Open CAT A 560 7227000
934 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000
935 10915.0 17486.0 15964.0 1549.0 1851.0 Not Provided CAT C 1220 7062000

87 rows × 9 columns

In [41]:
#menghapus entri yang memiliki data duplikat 
price.drop_duplicates(inplace=True)
print(price.duplicated().sum()) # no more duplicates
print(price.shape) # re-check by printing data size
0
(932, 9)

Variable Selection¶

Slicing data berdasarkan Tipe sangat penting, karena model tertentu hanya untuk suatu tipe data tertentu¶

In [42]:
# price
# Jika yang dibutuhkan memang hanya nama kolom, maka kita bisa melakukan hal ini untuk menghemat penggunaan memory
numVar = price.select_dtypes(include = ['float64', 'int64']).columns
list(numVar)
Out[42]:
['Dist_Taxi',
 'Dist_Market',
 'Dist_Hospital',
 'Carpet',
 'Builtup',
 'Rainfall',
 'House_Price']
In [43]:
# Memilih hanya variable dengan tipe tertentu
price_num = price.select_dtypes(include = ['float64', 'int64'])
price_num.head()
# Perhatikan price_num adalah variable df baru! ... (hati-hati di data yang besar)
Out[43]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Rainfall House_Price
0 9796.0 5250.0 10703.0 1659.0 1961.0 530 6649000
1 8294.0 8186.0 12694.0 1461.0 1752.0 210 3982000
2 11001.0 14399.0 16991.0 1340.0 1609.0 720 5401000
3 8301.0 11188.0 12289.0 1451.0 1748.0 620 5373000
4 10510.0 12629.0 13921.0 1770.0 2111.0 450 4662000

Distribusi nilai pada setiap variabel kategorik¶

In [44]:
# Memilih hanya variable dengan tipe tertentu
price_cat = price.select_dtypes(include = ['category'])
price_cat.head()
Out[44]:
Parking City_Category
0 Open CAT B
1 Not Provided CAT B
2 Not Provided CAT A
3 Covered CAT B
4 Not Provided CAT B
In [45]:
# get all unique values of a variable/column
for col in price_cat.columns:
    print(col,': ', set(price[col].unique()))
Parking :  {'Covered', 'No Parking', 'Open', 'Not Provided'}
City_Category :  {'CAT C', 'CAT A', 'CAT B'}

Kelak akan kita visualisasikan¶

Dasar Pengolahan variabel Kategorik: Dummy Variable¶

In [46]:
df = pd.get_dummies(price['Parking'], prefix='Park')
df.head()
Out[46]:
Park_Covered Park_No Parking Park_Not Provided Park_Open
0 False False False True
1 False False True False
2 False False True False
3 True False False False
4 False False True False

Menggabungkan dengan data awal (concat)¶

In [47]:
df2 = pd.concat([price, df], axis = 1)
df2.head().transpose() 
# gunakan transpose pada data berdimensi tinggi
Out[47]:
0 1 2 3 4
Dist_Taxi 9796.0 8294.0 11001.0 8301.0 10510.0
Dist_Market 5250.0 8186.0 14399.0 11188.0 12629.0
Dist_Hospital 10703.0 12694.0 16991.0 12289.0 13921.0
Carpet 1659.0 1461.0 1340.0 1451.0 1770.0
Builtup 1961.0 1752.0 1609.0 1748.0 2111.0
Parking Open Not Provided Not Provided Covered Not Provided
City_Category CAT B CAT B CAT A CAT B CAT B
Rainfall 530 210 720 620 450
House_Price 6649000 3982000 5401000 5373000 4662000
Park_Covered False False False True False
Park_No Parking False False False False False
Park_Not Provided False True True False True
Park_Open True False False False False

Memilih Data Secara Manual¶

In [48]:
# Choosing some columns manually
X = price[['House_Price','Dist_Market']] 
X[:7]
Out[48]:
House_Price Dist_Market
0 6649000 5250.0
1 3982000 8186.0
2 5401000 14399.0
3 5373000 11188.0
4 4662000 12629.0
5 4526000 5142.0
6 7224000 11869.0

Noisy Data¶

  • Noise dapat terjadi karena:
    • Kesalahan instrumen pengukuran: Misal di alat IoT pada saat cuaca buruk/baterai yang lemah.
    • Kesalahan input/entry
    • Transmisi yang tidak sempurna
    • inkonsistensi penamaan

Outliers¶

  • Data yang memiliki karakteristik secara signifikan berbeda dengan kebanyakan data lainnya menurut suatu kriteria tertentu yang ditetapkan.
    • Datanya valid (bukan Noise)
    • di Big Data sangat umum terjadi.
  • Apa yang sebaiknya dilakukan ke outliers?

Univariate Outliers¶

  • Quartiles (Boxplot)
  • Asumsi Normal
  • Asumsi distribusi lain

Multivariate Outliers¶

  • Clustering (DBSCAN)
  • Isolation Forest

Perbandingan beberapa metode pendeteksian outliers (multivariate):

  1.  http://scikit-learn.org/stable/auto_examples/applications/plot_outlier_detection_housing.html#sphx-glr-auto-examples-applications-plot-outlier-detection-housing-py&nbsp;
  2. http://scikit-learn.org/stable/auto_examples/covariance/plot_outlier_detection.html#sphx-glr-auto-examples-covariance-plot-outlier-detection-py
  3. http://scikit-learn.org/stable/auto_examples/neighbors/plot_lof.html#sphx-glr-auto-examples-neighbors-plot-lof-py
  4. http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py
  5. https://blog.dominodatalab.com/topology-and-density-based-clustering/

Apakah ada kecenderungan perbedaan harga rumah akibat dari tipe tempat parkir?¶

In [49]:
p= sns.catplot(x="Parking", y="House_Price", data=price)
# Apa yang bisa dilihat dari hasil ini?
No description has been provided for this image

Outlier atau noise? How to decide?¶

Univariate Outlier removal¶

Perlu asumsi "distribusi" dari datanya¶

Normality Assumption¶

In [50]:
# Distributions
p = sns.distplot(price['House_Price'], kde=True, rug=True)
No description has been provided for this image
In [51]:
# Misal dengan asumsi data berdistribusi normal
# dan menggunakan 95% confidence interval di sekitar variabel "harga"

df = np.abs(price.House_Price - price.House_Price.mean())<=(2*price.House_Price.std())
# mu-2s<x<mu+2s
print(df.shape)
df.head()
(932,)
Out[51]:
0    True
1    True
2    True
3    True
4    True
Name: House_Price, dtype: bool
In [52]:
price2 = price[df] # Data tanpa outliers
print(price2.shape, price.shape)
# Perhatikan disini sengaja data yang telah di remove outliernya 
# disimpan dalam variabel baru "Price2"
# Jika datanya besar hati-hati melakukan hal ini
(931, 9) (932, 9)
In [53]:
# Distributions
p = sns.distplot(price2['House_Price'], kde=True, rug=True)
No description has been provided for this image
In [54]:
p= sns.catplot(x="Parking", y="House_Price", data=price2)
# Apa yang bisa dilihat dari hasil ini?
No description has been provided for this image

Missing Values¶

Salah satu proses dalam ‘membersihkan data’ itu adalah mengidentifikasi dan menghandle missing value, apa itu missing value? Missing value adalah istilah untuk data yang hilang

Penyebab Missing Value¶

Data yang hilang ini bisa disebabkan oleh beberapa hal, salah satu contohnya adalah

  • Error pada data entry, baik itu human error ataupun kesalahan pada sistem
  • Pada data survey, bisa disebabkan oleh responden yang lupa mengisi pertanyaan, pertanyaan yang sulit dimengerti, ataupun pertanyaan enggan diisi karena merupakan pertanyaan yang sensitif

Bagaimana cara mendeteksi Missing Value?¶

Biasanya untuk menandakan bahwa suatu data hilang, cell tersebut dibiarkan kosong

Nah, permasalahan yang dihadapi pada data di lapangan adalah, penandaan untuk mengatakan bahwa data tersebut missing sangat beragam, bisa ditulis ‘?’ (tanda tanya), bisa ditulis ‘-‘ (strip), bisa suatu bilangan yang sangat besar atau sangat kecil (misal 99 atau -999)

Sebagai ilustrasi, perhatikan berikut ini:

Perhatikan bahwa data ini memiliki berbagai macam cara untuk mengatakan bahwa data pada cell tertentu adalah missing, misalnya:

  • cellnya dikosongkan
  • ditulis dengan n/a, NA, na, ataupun NaN
  • ditulis dengan symbol –
  • ataupun mempunyai nilai yang cukup aneh seperti nilai 12 pada kolom OWN_OCCUPIED, ataupun HURLEY pada kolom NUM_BATH

Ketika kita meng-load data ini ke python menggunakan pandas, beberapa notasi missing yang umum otomatis dikategorikan sebagai NaN (notasi missing value pada python)

Tipe Missing Value¶

Missing completely at random (MCAR)¶

Data hilang secara acak, dan tidak berkaitan dengan variabel tertentu

Missing at random (MAR)¶

Data di suatu variabel hilang hanya berkaitan dengan variabel respon/pengamatan. Sebagai contoh, orang yang memiliki rasa was-was tinggi (x) cenderung tidak melaporkan pendapatan (y) mereka, walaupun missing value bergantung pada berapa nilai x, tapi seberapa besar nilai y yang missing tersebut masih tetap acak

Missing not at random (MNAR)¶

Data di suatu variabel y berkaitan dengan variabel itu sendiri, tidak terdistribusi secara acak. Sebagai contoh, orang yang pendapatannya rendah cenderung tidak melaporkan pendapatannya. Tipe missing value ini yang relatif paling sulit untuk di handle



Pada MCAR dan MAR, kita boleh menghilangkan data dengan *missing value* ataupun mengimputasinya. Namun pada kasus MNAR, menghilangkan data dengan *missing value* akan menghasilkan bias pada data. mengimputasinya pun tidak selalu memberikan hasil yang baik

Menangani Missing Value¶

Setelah kita mengenali apa itu missing value, bagaimana biasanya missing value itu ditulis, dan juga apa saja tipe missing value. Sekarang akan dijelaskan bagaimana cara menghandle missing value

sumber gambar : https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4¶

Perlu dicatat bahwa, tidak ada metode yang benar benar terbaik dalam menghandle missing value, metode yang dapat digunakan akan bergantung pada tipe data dan masalah yang ditelaah

Menghindari data dengan missing value¶

yaitu drop data / menghapus data yang mengandung missing value ataupun menghapus variabel yang memiliki banyak sekali missing value

Cara menghapus data inipun ada beberapa macam

  1. Listwise Deletion, yaitu menghapus row yang mempunyai satu atau lebih missing
  1. Pairwise Deletion, yaitu hanya menghapus missing value pada variabel variabel yang ingin digunakan, misal kita ingin mencari korelasi antara glucose_conc dan diastolic_bp, kita hanya perlu menghapus row berikut ini
  1. Menghapus variabel, yaitu membuang variabel jika data pada kolom tersebut banyak sekali yang missing, misalkan hampir 50%.

Mengabaikan missing value¶

Beberapa algoritma machine learning atau metode analisis lainnya dapat dengan sendirinya menghandle missing value, contohnya adalah decision tree, k-Nearest Neighbors (kNN), Gradient Boosting Method (GBM) yang dapat mengabaikan missing value, ataupun XGBoost yang dapat mengimputasi sendiri missing value pada data

Ataupun jika ada beberapa kolom yang tidak memberikan informasi apa apa, kita dapat membiarkan missing value ada di kolom tersebut karena kolom tersebut pun tidak memberikan informasi yang signifikan, contohnya adalah nomor tiket pada data penerbangan, kita tidak perlu sulit-sulit memikirkan bagaimana cara mengimputasi kolom tersebut.

Mengimputasinya¶

Kita dapat menggantikan missing value tersebut dengan suatu nilai, ada beberapa metode dalam mengimputasi missing value

• Univariate Imputation¶

Imputasi dengan median / mean / modus¶

Imputasi dengan median / mean digunakan pada data numerik, idenya kita mengganti missing value pada kolom dengan median / mean dari data yang tidak missing, sedangkan imputasi dengan modus digunakan pada data kategorik.

(catatan : Jika distribusi data cukup skewed (menceng kanan atau kiri), atau terdapat nilai nilai ekstrim, median lebih di sarankan daripada mean)

Alternatifnya, kita pun dapat membedakan imputasi berdasarkan variabel kategorik tertentu, misalnya untuk yang penderita diabetes, akan diimputasi dengan rata rata dari penderita diabetes, dan sebaliknya

• Multivariate Imputation¶

Single Imputation¶

Metode metode yang dapat digunakan adalah memprediksi nilai missing dengan menggunakan metode metode supervised learning seperti kNN, regresi linear, regresi logistik (untuk data kategorik)

Kasus Lainnya¶

Salah satu cara menangani missing value pada data kategorik dapat dijadikan level tersendiri

missing value pada data Time Series, imputasi dapat dilakukan dengan:

  • mengisi nilai yang missing dengan nilai sebelumnya yang tidak missing, sering disebut juga dengan Last Observation Carried Forward (LOCF) ataupun dengan nilai selanjutnya yang tidak missing, sering disebut juga Next Observation Carried Backward (NOCB)

  • Menggunakan Interpolasi Linear

  • Menggunakan Interpolasi Linear dengan memperhitungkan tren seasonal

Missing Values¶

In [55]:
# General Look at the Missing Values
print(price2.isnull().sum())
Dist_Taxi        13
Dist_Market      13
Dist_Hospital     1
Carpet            8
Builtup          15
Parking           0
City_Category     0
Rainfall          0
House_Price       0
dtype: int64
In [56]:
set(price2['Parking'])
Out[56]:
{'Covered', 'No Parking', 'Not Provided', 'Open'}

Gambaran yang Lebih baik tentang MV terutama di Big Data¶

In [57]:
sns.heatmap(price2.isnull(), cbar=False)
plt.title('Heatmap Missing Value')
plt.show()
No description has been provided for this image
In [58]:
(price2.isnull().sum()/len(price2)).to_frame('persentase missing')
Out[58]:
persentase missing
Dist_Taxi 0.013963
Dist_Market 0.013963
Dist_Hospital 0.001074
Carpet 0.008593
Builtup 0.016112
Parking 0.000000
City_Category 0.000000
Rainfall 0.000000
House_Price 0.000000

Imputasi missing Values¶

In [59]:
print(price.isnull().sum())
price.head()
Dist_Taxi        13
Dist_Market      13
Dist_Hospital     1
Carpet            8
Builtup          15
Parking           0
City_Category     0
Rainfall          0
House_Price       0
dtype: int64
Out[59]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
0 9796.0 5250.0 10703.0 1659.0 1961.0 Open CAT B 530 6649000
1 8294.0 8186.0 12694.0 1461.0 1752.0 Not Provided CAT B 210 3982000
2 11001.0 14399.0 16991.0 1340.0 1609.0 Not Provided CAT A 720 5401000
3 8301.0 11188.0 12289.0 1451.0 1748.0 Covered CAT B 620 5373000
4 10510.0 12629.0 13921.0 1770.0 2111.0 Not Provided CAT B 450 4662000
In [60]:
price.info()
<class 'pandas.core.frame.DataFrame'>
Index: 932 entries, 0 to 931
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Dist_Taxi      919 non-null    float64 
 1   Dist_Market    919 non-null    float64 
 2   Dist_Hospital  931 non-null    float64 
 3   Carpet         924 non-null    float64 
 4   Builtup        917 non-null    float64 
 5   Parking        932 non-null    category
 6   City_Category  932 non-null    category
 7   Rainfall       932 non-null    int64   
 8   House_Price    932 non-null    int64   
dtypes: category(2), float64(5), int64(2)
memory usage: 60.4 KB
In [61]:
price["Builtup"].fillna(price["Builtup"].mean()) # Hati-hati sengaja tidak menggunakan inplace=True
Out[61]:
0      1961.0
1      1752.0
2      1609.0
3      1748.0
4      2111.0
        ...  
927    1910.0
928    1663.0
929    1436.0
930    1560.0
931    1429.0
Name: Builtup, Length: 932, dtype: float64

Pelajari lebih lanjut disini:¶

https://towardsdatascience.com/imputing-missing-data-with-simple-and-advanced-techniques-f5c7b157fb87¶

Exclude Missing Values¶

In [62]:
# Simplest solution, if the MV is not a lot
# drop rows with missing values : Ada berbagai cara
X = price.dropna() # jika ada MV minimal satu di salah satu kolom, maka baris di hapus
price2.dropna(how='all') # jika ada MV di semua kolom, maka baris di hapus
price2.dropna(thresh=2) # jika ada MV minimal di salah 2 kolom, maka baris di hapus
price2.dropna(subset=['Dist_Hospital'])[:7] # jika ada MV minimal satu di salah kolom Dist_Hospital
# inplace=True if really really sure
price2.dropna(inplace=True)
In [63]:
print(price2.isnull().sum())
Dist_Taxi        0
Dist_Market      0
Dist_Hospital    0
Carpet           0
Builtup          0
Parking          0
City_Category    0
Rainfall         0
House_Price      0
dtype: int64

Saving (preprocessed) Data¶

In [64]:
# Saving the preprocessed Data for future use/analysis
price2.to_csv("data/price_PreProcessed.csv", encoding='utf8', index=False)

Perhatian untuk studi kasus minggu besok juga dibutuhkan:¶

https://pandas.pydata.org/docs/user_guide/merging.html¶

Pendahuluan Visualisasi ¶

  • Setelah melakukan data preprocessing, maka visualisasi dapat digunakan untuk:
  • Mengetahui apakah perlu preprocessing lebih lanjut.
  • Mendapatkan informasi/insight dasar dari data.
  • Mendapatkan hipotesis/dugaan untuk diuji dengan model di tahap berikutnya.
  • Kelak visualisasi juga digunakan untuk melakukan pelaporan performa/hasil prediksi model.
  • Contoh (dasar/generik) tujuan visualisasi: monitor system, tracking (IKU/statistics), tell stories, show outliers/trends, support argumen, atau sekedar overview data (e.g. Kibana).

Python Visualization modules Map

In [65]:
# dalam module ini kita membutuhkan beberapa module tambahan
# Jika anda menjalankan Jupyter notebook ini secara lokal, maka perlu penyesuaian
try:
    import google.colab; IN_COLAB = True
    !pip install statsmodels folium chart_studio plotly
except:
    print('Jika belum, silahkan install module statsmodels folium chart_studio plotly dari terminal Env Python anda (recommended).') #IN_COLAB = False
Jika belum, silahkan install module statsmodels folium chart_studio plotly dari terminal Env Python anda (recommended).
In [66]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns, numpy as np
import matplotlib.cm as cm
import calendar, folium
from folium.plugins import HeatMap
from collections import Counter
from statsmodels.graphics.mosaicplot import mosaic
plt.style.use('bmh'); sns.set()

Apakah ada kecenderungan perbedaan harga rumah akibat dari tipe tempat parkir?¶

In [67]:
p= sns.catplot(x="Parking", y="House_Price", data=price2)
# Apa yang bisa dilihat dari hasil ini?
No description has been provided for this image

Tambah dimensi di Visualisasi untuk melihat insight yang lebih jelas/baik¶

In [68]:
# Bisa juga plot dengan informasi dari 3 variabel sekaligus
# (untuk melihat kemungkinan faktor interaksi)
p= sns.catplot(x="Parking", y="House_Price", hue="City_Category", kind="swarm", data=price2)
No description has been provided for this image

fdgsaerg argqergeqry

Ada informasi apakah dari hasil diatas?¶

1D Visualization: Bar Chart / Count Plot¶

Image Source: https://datavizcatalogue.com/methods/bar_chart.html

Hati-hati: Bar Chart VS Histogram ¶

image Source: https://www.mathsisfun.com/data/bar-graphs.html

In [69]:
plt.figure(figsize=(8,6)) # https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
p = sns.countplot(x="City_Category", hue="Parking", data=price2)
No description has been provided for this image

Horizontal? Why?¶

In [70]:
ax = sns.countplot(y = 'Parking', hue = 'City_Category', palette = 'muted', data=price2)
No description has been provided for this image
In [71]:
# Demo "SubPlot" tapi menggunakan data berbeda karena data price hanya punya 2 var kategori.

tips=sns.load_dataset('tips') # Data built-in dari Module Seaborn ... akan dijelaskan lebih lanjut di bawah.
categorical = tips.select_dtypes(include = ['category']).columns

fig, ax = plt.subplots(2, 2, figsize=(12, 6))
for variable, subplot in zip(categorical, ax.flatten()):
    sns.countplot(tips, x=variable, ax=subplot)
No description has been provided for this image
In [72]:
tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB

Adding labels? ... Hhhmmm...¶

In [73]:
X = price2[price2["Parking"].isin(["Open","Covered"])]
X = X[X["House_Price"]<7000000]
X.groupby(["Parking", "City_Category"]).size().unstack()
Out[73]:
City_Category CAT A CAT B CAT C
Parking
Covered 18 48 47
No Parking 0 0 0
Not Provided 0 0 0
Open 35 132 88
In [74]:
def groupedbarplot(df, width=0.8, annotate="values", ax=None, **kw):
    ax = ax or plt.gca()
    n = len(df.columns)
    w = 1./n
    pos = (np.linspace(w/2., 1-w/2., n)-0.5)*width
    w *= width
    bars = []
    for col, x in zip(df.columns, pos):
        bars.append(ax.bar(np.arange(len(df))+x, df[col].values, width=w, **kw))
        for val, xi in zip(df[col].values, np.arange(len(df))+x):
            if annotate:
                txt = val if annotate == "values" else col
                ax.annotate(txt, xy=(xi, val), xytext=(0,2), 
                            textcoords="offset points",
                            ha="center", va="bottom")
    ax.set_xticks(np.arange(len(df)))
    ax.set_xticklabels(df.index)
    return bars
In [75]:
counts = price2.groupby(["Parking", "City_Category"]).size().unstack()
plt.figure(figsize=(12,8))
groupedbarplot(counts)
plt.show()
No description has been provided for this image

Stacked/Segmented Chart¶

In [ ]:
 
In [76]:
CT = pd.crosstab(index=price2["City_Category"], columns=price2["Parking"])
p = CT.plot(kind="bar", figsize=(8,8), stacked=True)
No description has been provided for this image
In [77]:
# ini dilakukan jika kita ingin menyimpan plotnya ke dalam suatu file
p.figure.savefig('barChart.png')
# lihat di folder ipynb-nya akan muncul file baru.

Mosaic Plot for multiple categorical data analysis¶

In [78]:
p = mosaic(tips, ['sex','smoker','time'])
No description has been provided for this image

Pie Chart¶

Image Source: https://datavizcatalogue.com/methods/pie_chart.html

In [79]:
# PieChart
plot = price2.City_Category.value_counts().plot(kind='pie')
No description has been provided for this image

Show Values?¶

In [80]:
data = price2['Parking']

proporsion = Counter(data)
values = [float(v) for v in proporsion.values()]
colors = ['r', 'g', 'b', 'y']
labels = proporsion.keys()
explode = (0.1, 0, 0, 0)
plt.pie(values, colors=colors, labels= values, explode=explode, shadow=True)
plt.title('Proporsi Tipe Parkir')
plt.legend(labels, loc='best')
plt.show()
No description has been provided for this image

Box Plot¶

  • Lower Extreme: $Q_1 - 1.5(Q_3-Q_1)$ Upper Extreme $Q_3 + 1.5(Q_3-Q_1)$
  • Source: https://datavizcatalogue.com/methods/box_plot.html & https://lsc.deployopex.com/box-plot-with-jmp/
In [81]:
# Jika ada outlier grafiknya menjadi tidak jelas (data = price, bukan price2)
p = sns.boxplot(x="House_Price", y="Parking", data=price)
No description has been provided for this image
In [82]:
# BoxPlots
p = sns.boxplot(x="House_Price", y="Parking", data=price2)
# Apa makna pola yang terlihat di data oleh BoxPlot ini?
No description has been provided for this image

Bagaimana mendapatkan data-data outliernya?¶

  • Hati-hati beda iloc dan loc di Dataframe.
  • Hati-hati Rumus Outlier Boxplot di SeaBorn!!!...
In [83]:
Q1 = price2['House_Price'].quantile(0.25)
Q3 = price2['House_Price'].quantile(0.75)
IQR = Q3 - Q1 #IQR is interquartile range. 
print("Q1={}, Q3={}, IQR={}".format(Q1, Q3, IQR))

outliers_ = (price2['House_Price'] < (Q1 - 1.5 *IQR)) # Outlier bawah
rumah_potensial = price2.loc[outliers_]
rumah_potensial
Q1=4638000.0, Q3=7183000.0, IQR=2545000.0
Out[83]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price

Boxplot dapat juga dipisahkan berdasarkan suatu kategori¶

In [84]:
p = sns.catplot(x="Parking", y="House_Price", hue="City_Category", kind="box", data=price2)
No description has been provided for this image
  • Ada dugaan/interpretasi (baru) apakah dari boxPlot diatas?
  • Apakah kelemahan (PitFalls) Box Plot?

Swarn Plot & Violin Plot¶

Menangani kelemahan BoxPlot.¶

In [85]:
p= sns.catplot(x="day", y="total_bill", hue="sex", kind="swarm", data=tips)
No description has been provided for this image
In [86]:
p = sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow')
No description has been provided for this image

histogram¶

image source: https://datavizcatalogue.com/methods/histogram.html

In [87]:
numerical = price2.select_dtypes(include = ['int64','float64']).columns

price2[numerical].hist(figsize=(15, 6), layout=(2, 4));
No description has been provided for this image

Scatter Plot¶

image source: https://datavizcatalogue.com/methods/scatterplot.html

In [88]:
p = sns.scatterplot(x=price2['House_Price'], y=price2['Dist_Market'], hue = price2['Parking'])
No description has been provided for this image

Bigger picture?¶

In [89]:
fig, ax = plt.subplots(1, 1, figsize=(12,8))

p = sns.scatterplot(x=price2['House_Price'], y=price2['Dist_Market'], hue = price2['Parking'], ax=ax)
No description has been provided for this image

Joined¶

In [90]:
p = sns.jointplot(x=price2['House_Price'], y=price2['Rainfall'], hue = price2['Parking'])
No description has been provided for this image

Conditional Plot¶

In [91]:
cond_plot = sns.FacetGrid(data=price2, col='Parking', hue='City_Category')#, hue_order=["Yes", "No"]
p = cond_plot.map(sns.scatterplot, 'Dist_Hospital', 'House_Price').add_legend()
No description has been provided for this image

Pairwise Plot¶

In [92]:
# Coba kita perhatikan sebagiannya saja dulu dan coba kelompokkan berdasarkan "Parking"
p = sns.pairplot(price2[['House_Price','Builtup','Dist_Hospital','Parking']], hue="Parking")
# Ada pola menarik?
No description has been provided for this image

Checking Correlations¶

In [93]:
price2.select_dtypes(include=np.number).corr()
Out[93]:
Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Rainfall House_Price
Dist_Taxi 1.000000 0.453479 0.795520 0.008703 0.008230 0.013540 0.103393
Dist_Market 0.453479 1.000000 0.621466 -0.020778 -0.020384 0.069806 0.116795
Dist_Hospital 0.795520 0.621466 1.000000 0.011706 0.011960 0.046826 0.131799
Carpet 0.008703 -0.020778 0.011706 1.000000 0.998885 -0.043485 0.096229
Builtup 0.008230 -0.020384 0.011960 0.998885 1.000000 -0.043424 0.097417
Rainfall 0.013540 0.069806 0.046826 -0.043485 -0.043424 1.000000 0.014383
House_Price 0.103393 0.116795 0.131799 0.096229 0.097417 0.014383 1.000000
In [94]:
# HeatMap untuk menyelidiki korelasi
corr2 = price2.select_dtypes(include=np.number).corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 14}, square=True);
No description has been provided for this image

Visual Python¶

https://visualpython.ai/¶

In [ ]:
 

Visualization Design¶

Beberapa Catatan Tambahan¶

  • Design di Flip Class tidak wajib, namun bisa menjadi nilai tambah (plus)
  • Visualisasi boleh menggunakan Excell, tableau, dan software lain. Namun image-nya di tampilkan di jupyter notebook (as PNG/JPEG).
  • Laporan tentang preprocessing adalah tentang kualitas data.
  • Jangan lupa interpretasi dan rekomendasi wajib ada.
  • Hati-hati penggunaan narasi dalam interpretasi di EDA, usahakan menghindari kalimat yang kuat (strong).

End of Module¶